GitLab.com Database Incident - 2017/01/31
Note: This incident affected the database (including issues and merge requests) but not the git repo's (repositories and wikis).
Recovery - 2017/01/31 23:00 (backup from ±17:20 UTC) 2
Hugops (please add kind reactions here, from twitter and elsewhere) 5
Timeline (all times UTC)
- 2017/01/31 16:00/17:00 - 21:00
- YP is working on setting up pgpool and replication in staging, creates an LVM snapshot to get up to date production data to staging, hoping he can re-use this for bootstrapping other replicas. This was done roughly 6 hours before data loss.
- Getting replication to work is proving to be problematic and time consuming (estimated at ±20 hours just for the initial pg_basebackup sync). The LVM snapshot is not usable on the other replicas as far as YP could figure out. Work is interrupted due to this (as YP needs the help of another collegue who’s not working this day), and due to spam/high load on GitLab.com
- 2017/01/31 21:00 - Spike in database load due to spam users - Twitter | Slack
- Blocked users based on IP address
- Removed a user for using a repository as some form of CDN, resulting in 47 000 IPs signing in using the same account (causing high DB load). This was communicated with the infrastructure and support team.
- Removed users for spamming (by creating snippets) - Slack
- Database load goes back to normal, some manual PostgreSQL vacuuming is applied here and there to catch up with a large amount of dead tuples.
- 2017/01/31 22:00 - Replication lag alert triggered in pagerduty Slack
- Attempts to fix db2, it’s lagging behind by about 4 GB at this point
- db2.cluster refuses to replicate, /var/opt/gitlab/postgresql/data is wiped to ensure a clean replication
- db2.cluster refuses to connect to db1, complaining about max_wal_senders being too low. This setting is used to limit the number of WAL (= replication) clients
- YP adjusts max_wal_senders to 32 on db1, restarts PostgreSQL
- PostgreSQL complains about too many semaphores being open, refusing to start
- YP adjusts max_connections to 2000 from 8000, PostgreSQL starts again (despite 8000 having been used for almost a year)
- db2.cluster still refuses to replicate, though it no longer complains about connections; instead it just hangs there not doing anything
- At this point frustration begins to kick in. Earlier this night YP explicitly mentioned he was going to sign off as it was getting late (23:00 or so local time), but didn’t due to the replication problems popping up all of a sudden.
- 2017/01/31 23:00-ish
- YP thinks that perhaps pg_basebackup is being super pedantic about there being an empty data directory, decides to remove the directory. After a second or two he notices he ran it on db1.cluster.gitlab.com, instead of db2.cluster.gitlab.com
- 2017/01/31 23:27 YP - terminates the removal, but it’s too late. Of around 310 GB only about 4.5 GB is left - Slack
Recovery - 2017/01/31 23:00 (backup from ±17:20 UTC)
- Suggested recovery solutions:
- Migrate db1.staging.gitlab.com data to GitLab.com (±6 hours old)
- CW: Problem with web hooks, these are removed as part of the staging sync.
- Restore LVM snapshot (6 hours old)
- Sid: try to undelete files?
- CW: Not possible! `rm -Rvf` Sid: OK
- JEJ: Probably too late, but isn't it sometimes possible if you make the disk read-only quickly enough? Also might still have file descriptor if the file was in use by a running process according to http://unix.stackexchange.com/a/101247/213510
- YP: PostgreSQL doesn't keep all files open at all times, so that wouldn't work. Also, Azure is apparently also really good in removing data quickly, but not at sending it over to replicas. In other words, the data can't be recovered from the disk itself.
- SH: It appears the db1 staging server runs a separate PostgreSQL process under the gitlab_replicator directory that streams production data from db2. Due to replication lag, db2 was killed 2016-01-31 05:53, which caused the gitlab_replicator to stop. The good news is that the data up until that point looks unaltered, so we may be able to recover the WebHook data.
- Action taken:
- 2017/02/01 23:00 - 00:00: The decision is made to restore data from db1.staging.gitlab.com to db1.cluster.gitlab.com (production). While 6 hours old and without webhooks, it’s the only available snapshot. YP says it’s best for him not to run anything with sudo any more today, handing off the restoring to JN.
- 2017/02/01 00:36 - JN: Backup db1.staging.gitlab.com data
- 2017/02/01 00:55 - JN: Mount db1.staging.gitlab.com on db1.cluster.gitlab.com
- Copy data from staging /var/opt/gitlab/postgresql/data/ to production /var/opt/gitlab/postgresql/data/
- 2017/02/01 01:05 - JN: nfs-share01 server commandeered as temp storage place in /var/opt/gitlab/db-meltdown
- 2017/02/01 01:18 - JN: Copy of remaining production data, including pg_xlog tar’ed up as ‘20170131-db-meltodwn-backup.tar.gz’
- 2017/02/01 01:58 - JN: Start rsync from stage to production
- 2017/02/01 02:00 - CW: Updated deploy page to explain the situation. Link
- 2017/02/01 03:00 - AR: rsync progress approximately 50% (by # of files)
- 2017/02/01 04:00 - JN: rsync progress approximately 56.4% (by # of files). Data transfer is slowed by two factors: network I/O between us-east and us-east-2 and disk throughput cap on staging server (60 Mb/s).
- 2017/02/01 07:00 - JN: Found a copy of pre-sanitized data in on db1 staging in /var/opt/gitlab_replicator/postgresql. Started db-crutch VM in us-east to backup this data to another host. Unfortunately, this system maxes out at 120 GB RAM and cannot support the production load. This copy will be used to check the database state and export the WebHook data.
- 2017/02/01 08:07 - JN: Data transfer has been slow: total transfer progress by data size is 42%.
- Restore procedure
- Upgrade db1.cluster.gitlab.com to PostgreSQL 9.6.1 as it’s still running 9.6.0 while staging uses 9.6.1 (PostgreSQL might not start otherwise)
- Start the DB
- Update the Sentry DSN
- Attempt to restore webhooks, if possible
- Flush Rails/Redis cache
- Gradually start workers
- Disable deploy page
- Remove the spam users again (so they don’t cause problems again)
- TODO after data restored:
- Remove the users we removed earlier today due to spam/abuse.
- Create outage issue
- Create issue to change terminal PS1 format/colours to make it clear whether you’re using production or staging (red production, yellow staging)
- Show the full hostname in the bash prompt for all users by default (e.g., “db1.staging.gitlab.com” instead of just “db1”)
- Somehow disallow rm -rf for the PostgreSQL data directory? Unsure if this is feasible, or necessary once we have proper backups
- Add alerting for backups: check S3 storage etc.
- Consider adding a last successful backup time in DB so admins can see this easily (suggested by customer in https://gitlab.zendesk.com/agent/tickets/58274)
- Figure out why PostgreSQL suddenly had problems with max_connections being set to 8000, despite it having been set to that since 2016-05-13. A large portion of frustration arose because of this suddenly becoming a problem.
- Upgrade dbX.cluster to PostgreSQL 9.6.1 as it’s still running the pinned 9.6.0 package (used for the Slony upgrade from 9.2 to 9.6.0)
- Flush Redis cache once the DB has been restored
- Add server hostname to bash PS1 (avoid running commands on the wrong host)
- Look into increasing replication thresholds via WAL archiving
Problems Encountered
- LVM snapshots are by default only taken once every 24 hours. YP happened to run one manually about 6 hours prior to the outage
- Regular backups seem to also only be taken once per 24 hours, though YP has not yet been able to figure out where they are stored. According to JN these don’t appear to be working, producing files only a few bytes in size.
- SH: It looks like pg_dump may be failing because PostgreSQL 9.2 binaries are being run instead of 9.6 binaries. This happens because omnibus only uses Pg 9.6 if data/PG_VERSION is set to 9.6, but on workers this file does not exist. As a result it defaults to 9.2, failing silently. No SQL dumps were made as a result. Fog gem may have cleaned out older backups.
- Disk snapshots in Azure are enabled for the NFS server, but not for the DB servers.
- The synchronisation process removes webhooks once it has synchronised data to staging. Unless we can pull these from a regular backup from the past 24 hours they will be lost
- The replication procedure is super fragile, prone to error, relies on a handful of random shell scripts, and is badly documented
- SH: We learned later the staging DB refresh works by taking a snapshot of the gitlab_replicator directory, prunes the replication configuration, and starts up a separate PostgreSQL server.
- Our backups to S3 apparently don’t work either: the bucket is empty
- We don’t have solid alerting/paging for when backups fails, we are seeing this in the dev host too now.
So in other words, out of 5 backup/replication techniques deployed none are working reliably or set up in the first place. => we're now restoring a backup from 6 hours ago that worked
http://monitor.gitlab.net/dashboard/db/postgres-stats?panelId=10&fullscreen&from=now-24h&to=now
External help
Hugops (please add kind reactions here, from twitter and elsewhere)
Stephen Frost
https://twitter.com/net_snow/status/826622954964393984 @gitlabstatus hey, I'm a PG committer, major contributor, and love what you all do. HMU if I can help in any way, I'd be happy to help.
Sam McLeod
Hey Sid, Sorry to hear about your database / LVM issue, bugger of a thing to happen. Hey we run quite a number of PostgreSQL clusters (master/slave) and I noticed a few things in your report. 1. You're using Slony - that thing is a flaming piece of shit, not an understatement even have a laugh at it by following http://howfuckedismydatabase.com , PostgreSQL's inbuilt binary streaming replication however is rock solid and very fast, I suggest switching to that. 2. No mention of a connection pooler and mention of having thousands of connections set in postgresql.conf - this is really bad and very inefficient for performance, I suggest using pg_bouncer as a connection pooler - https://pgbouncer.github.io/ and not setting PostgreSQL's max_connection over 512-1024, realistically if you're using more than 256 active connections - you need to scale out not up. 3. The report mentions how fragile your failover and backup processes are, we wrote a simple script for postgresql failover and documentation to go with it - would you like me to provide you with it? As far as backups - we use pgbarman to perform many incremental backups during the day and fill backups twice daily bother via barman and postgresql's pg_dump command, it's important to have your backup directory on different storage from your postgresql data for both performance and resiliancy / portability. 4. You're still on Azure?!?! I'd suggest getting off that crudbucket, so many internal DNS, NTP, routing and storage IO issues with Microsoft's platform it's ridiculous, I've heard some horror stories of how its held together internally too.
Let me know if you'd like any more advice on tuning PostgreSQL, I've had a lot of experience with it.
Capt. McLeod
also - question - how big is your database(s) on disk? like are we talking TB here or still in the GB?
7h 7 hours ago
Capt. McLeod
open sourced my failover / replication sync script:
7h 7 hours ago
Also - I see you're looking at pgpool - I would not suggest that, look at pgbouncer instead
Capt. McLeod
Pgpool has lots of problems, we tested it thoroughly and then binned it
5h 5 hours ago
Capt. McLeod
Also, let me know if there's anything I can say publicly on twitter or whatever to support GitLab and your transparency through this event, I know how shit these things are, we had SAN level split brain at infoxchange when I first started and I was so nervous I was vomiting!
4h 4 hours ago
Sid Sijbrandij
Hi Sam, thanks for all the help. Mind if I paste it in a public document to share with the rest of the team?
3m 3 minutes ago
Capt. McLeod
The failover script?
3m 2 minutes ago
Sid Sijbrandij
Everything you messaged.
2m 1 minute ago
Sure, it's a public repo anyway, but yeah I'm not saying it's perfect - far from but it does work really reliable, I fail between hosts all the time without issue, but YMMV etc etc
Yeah absolute re: other recommendations too
If you can send me information about your VM that has PostgreSQL on it and your PostgreSQL.conf file I can make comments on any changes / concerns and explain each one
Notes regarding the above:
- Slony was only used for upgrading from 9.2 to 9.6, we use streaming replication for our regular replication needs.
- Rails already pools/re-uses connections, with 25 connections per process. With 20-ish processes for 20 hosts this produces 10 000 max connections, though there will only be 400-ish active concurrently (as Unicorn is single threaded)
- For load balancing, pooling, better failover, etc, we’re looking into pgpool + streaming replication, with synchronous commits (for data consistency). Pgbouncer does not do load balancing (at least out of the box) as far as we’re aware, only connection pooling. https://github.com/awslabs/pgbouncer-rr-patch might also be an option.