GitLab Database Incident Report

GitLab.com Database Incident - 2017/01/31

Note: This incident affected the database (including issues and merge requests) but not the git repo's (repositories and wikis).

Timeline (all times UTC) 1

Recovery - 2017/01/31 23:00 (backup from ±17:20 UTC) 2

Problems Encountered 4

External help 5

Hugops (please add kind reactions here, from twitter and elsewhere) 5

Stephen Frost 5

Sam McLeod 5

Timeline (all times UTC)

2017/01/31 16:00/17:00 - 21:00

YP is working on setting up pgpool and replication in staging, creates an LVM snapshot to get up to date production data to staging, hoping he can re-use this for bootstrapping other replicas. This was done roughly 6 hours before data loss.
Getting replication to work is proving to be problematic and time consuming (estimated at ±20 hours just for the initial pg_basebackup sync). The LVM snapshot is not usable on the other replicas as far as YP could figure out. Work is interrupted due to this (as YP needs the help of another collegue who’s not working this day), and due to spam/high load on GitLab.com

2017/01/31 21:00 - Spike in database load due to spam users - Twitter | Slack

Blocked users based on IP address
Removed a user for using a repository as some form of CDN, resulting in 47 000 IPs signing in using the same account (causing high DB load). This was communicated with the infrastructure and support team.
Removed users for spamming (by creating snippets) - Slack
Database load goes back to normal, some manual PostgreSQL vacuuming is applied here and there to catch up with a large amount of dead tuples.

2017/01/31 22:00 - Replication lag alert triggered in pagerduty Slack

Attempts to fix db2, it’s lagging behind by about 4 GB at this point
db2.cluster refuses to replicate, /var/opt/gitlab/postgresql/data is wiped to ensure a clean replication
db2.cluster refuses to connect to db1, complaining about max_wal_senders being too low. This setting is used to limit the number of WAL (= replication) clients
YP adjusts max_wal_senders to 32 on db1, restarts PostgreSQL
PostgreSQL complains about too many semaphores being open, refusing to start
YP adjusts max_connections to 2000 from 8000, PostgreSQL starts again (despite 8000 having been used for almost a year)
db2.cluster still refuses to replicate, though it no longer complains about connections; instead it just hangs there not doing anything
At this point frustration begins to kick in. Earlier this night YP explicitly mentioned he was going to sign off as it was getting late (23:00 or so local time), but didn’t due to the replication problems popping up all of a sudden.

2017/01/31 23:00-ish

YP thinks that perhaps pg_basebackup is being super pedantic about there being an empty data directory, decides to remove the directory. After a second or two he notices he ran it on db1.cluster.gitlab.com, instead of db2.cluster.gitlab.com
2017/01/31 23:27 YP - terminates the removal, but it’s too late. Of around 310 GB only about 4.5 GB is left - Slack

Recovery - 2017/01/31 23:00 (backup from ±17:20 UTC)

Suggested recovery solutions:

Migrate db1.staging.gitlab.com data to GitLab.com (±6 hours old)

CW: Problem with web hooks, these are removed as part of the staging sync.

Restore LVM snapshot (6 hours old)
Sid: try to undelete files?

CW: Not possible! `rm -Rvf` Sid: OK
JEJ: Probably too late, but isn't it sometimes possible if you make the disk read-only quickly enough? Also might still have file descriptor if the file was in use by a running process according to http://unix.stackexchange.com/a/101247/213510
YP: PostgreSQL doesn't keep all files open at all times, so that wouldn't work. Also, Azure is apparently also really good in removing data quickly, but not at sending it over to replicas. In other words, the data can't be recovered from the disk itself.
SH: It appears the db1 staging server runs a separate PostgreSQL process under the gitlab_replicator directory that streams production data from db2. Due to replication lag, db2 was killed 2016-01-31 05:53, which caused the gitlab_replicator to stop. The good news is that the data up until that point looks unaltered, so we may be able to recover the WebHook data.

Action taken:

2017/02/01 23:00 - 00:00: The decision is made to restore data from db1.staging.gitlab.com to db1.cluster.gitlab.com (production). While 6 hours old and without webhooks, it’s the only available snapshot. YP says it’s best for him not to run anything with sudo any more today, handing off the restoring to JN.
2017/02/01 00:36 - JN: Backup db1.staging.gitlab.com data
2017/02/01 00:55 - JN: Mount db1.staging.gitlab.com on db1.cluster.gitlab.com

Copy data from staging /var/opt/gitlab/postgresql/data/ to production /var/opt/gitlab/postgresql/data/

2017/02/01 01:05 - JN: nfs-share01 server commandeered as temp storage place in /var/opt/gitlab/db-meltdown
2017/02/01 01:18 - JN: Copy of remaining production data, including pg_xlog tar’ed up as ‘20170131-db-meltodwn-backup.tar.gz’
2017/02/01 01:58 - JN: Start rsync from stage to production
2017/02/01 02:00 - CW: Updated deploy page to explain the situation. Link
2017/02/01 03:00 - AR: rsync progress approximately 50% (by # of files)
2017/02/01 04:00 - JN: rsync progress approximately 56.4% (by # of files). Data transfer is slowed by two factors: network I/O between us-east and us-east-2 and disk throughput cap on staging server (60 Mb/s).
2017/02/01 07:00 - JN: Found a copy of pre-sanitized data in on db1 staging in /var/opt/gitlab_replicator/postgresql. Started db-crutch VM in us-east to backup this data to another host. Unfortunately, this system maxes out at 120 GB RAM and cannot support the production load. This copy will be used to check the database state and export the WebHook data.
2017/02/01 08:07 - JN: Data transfer has been slow: total transfer progress by data size is 42%.

Restore procedure

Upgrade db1.cluster.gitlab.com to PostgreSQL 9.6.1 as it’s still running 9.6.0 while staging uses 9.6.1 (PostgreSQL might not start otherwise)
Start the DB
Update the Sentry DSN
Attempt to restore webhooks, if possible
Flush Rails/Redis cache
Gradually start workers
Disable deploy page
Remove the spam users again (so they don’t cause problems again)

TODO after data restored:

Remove the users we removed earlier today due to spam/abuse.
Create outage issue
Create issue to change terminal PS1 format/colours to make it clear whether you’re using production or staging (red production, yellow staging)
Show the full hostname in the bash prompt for all users by default (e.g., “db1.staging.gitlab.com” instead of just “db1”)
Somehow disallow rm -rf for the PostgreSQL data directory? Unsure if this is feasible, or necessary once we have proper backups
Add alerting for backups: check S3 storage etc.
Consider adding a last successful backup time in DB so admins can see this easily (suggested by customer in https://gitlab.zendesk.com/agent/tickets/58274)
Figure out why PostgreSQL suddenly had problems with max_connections being set to 8000, despite it having been set to that since 2016-05-13. A large portion of frustration arose because of this suddenly becoming a problem.
Upgrade dbX.cluster to PostgreSQL 9.6.1 as it’s still running the pinned 9.6.0 package (used for the Slony upgrade from 9.2 to 9.6.0)
Flush Redis cache once the DB has been restored
Add server hostname to bash PS1 (avoid running commands on the wrong host)
Look into increasing replication thresholds via WAL archiving

Problems Encountered

LVM snapshots are by default only taken once every 24 hours. YP happened to run one manually about 6 hours prior to the outage
Regular backups seem to also only be taken once per 24 hours, though YP has not yet been able to figure out where they are stored. According to JN these don’t appear to be working, producing files only a few bytes in size.

SH: It looks like pg_dump may be failing because PostgreSQL 9.2 binaries are being run instead of 9.6 binaries. This happens because omnibus only uses Pg 9.6 if data/PG_VERSION is set to 9.6, but on workers this file does not exist. As a result it defaults to 9.2, failing silently. No SQL dumps were made as a result. Fog gem may have cleaned out older backups.

Disk snapshots in Azure are enabled for the NFS server, but not for the DB servers.
The synchronisation process removes webhooks once it has synchronised data to staging. Unless we can pull these from a regular backup from the past 24 hours they will be lost
The replication procedure is super fragile, prone to error, relies on a handful of random shell scripts, and is badly documented

SH: We learned later the staging DB refresh works by taking a snapshot of the gitlab_replicator directory, prunes the replication configuration, and starts up a separate PostgreSQL server.

Our backups to S3 apparently don’t work either: the bucket is empty
We don’t have solid alerting/paging for when backups fails, we are seeing this in the dev host too now.

So in other words, out of 5 backup/replication techniques deployed none are working reliably or set up in the first place. => we're now restoring a backup from 6 hours ago that worked

http://monitor.gitlab.net/dashboard/db/postgres-stats?panelId=10&fullscreen&from=now-24h&to=now

External help

Hugops (please add kind reactions here, from twitter and elsewhere)

Stephen Frost

https://twitter.com/net_snow/status/826622954964393984 @gitlabstatus hey, I'm a PG committer, major contributor, and love what you all do. HMU if I can help in any way, I'd be happy to help.

Sam McLeod

Hey Sid, Sorry to hear about your database / LVM issue, bugger of a thing to happen. Hey we run quite a number of PostgreSQL clusters (master/slave) and I noticed a few things in your report. 1. You're using Slony - that thing is a flaming piece of shit, not an understatement even have a laugh at it by following http://howfuckedismydatabase.com , PostgreSQL's inbuilt binary streaming replication however is rock solid and very fast, I suggest switching to that. 2. No mention of a connection pooler and mention of having thousands of connections set in postgresql.conf - this is really bad and very inefficient for performance, I suggest using pg_bouncer as a connection pooler - https://pgbouncer.github.io/ and not setting PostgreSQL's max_connection over 512-1024, realistically if you're using more than 256 active connections - you need to scale out not up. 3. The report mentions how fragile your failover and backup processes are, we wrote a simple script for postgresql failover and documentation to go with it - would you like me to provide you with it? As far as backups - we use pgbarman to perform many incremental backups during the day and fill backups twice daily bother via barman and postgresql's pg_dump command, it's important to have your backup directory on different storage from your postgresql data for both performance and resiliancy / portability. 4. You're still on Azure?!?! I'd suggest getting off that crudbucket, so many internal DNS, NTP, routing and storage IO issues with Microsoft's platform it's ridiculous, I've heard some horror stories of how its held together internally too.

Let me know if you'd like any more advice on tuning PostgreSQL, I've had a lot of experience with it.

Capt. McLeod

also - question - how big is your database(s) on disk? like are we talking TB here or still in the GB?

7h 7 hours ago

Capt. McLeod

open sourced my failover / replication sync script:

7h 7 hours ago

Also - I see you're looking at pgpool - I would not suggest that, look at pgbouncer instead

Capt. McLeod

Pgpool has lots of problems, we tested it thoroughly and then binned it

5h 5 hours ago

Capt. McLeod

Also, let me know if there's anything I can say publicly on twitter or whatever to support GitLab and your transparency through this event, I know how shit these things are, we had SAN level split brain at infoxchange when I first started and I was so nervous I was vomiting!

4h 4 hours ago

Sid Sijbrandij

Hi Sam, thanks for all the help. Mind if I paste it in a public document to share with the rest of the team?

3m 3 minutes ago

Capt. McLeod

The failover script?

3m 2 minutes ago

Sid Sijbrandij

Everything you messaged.

2m 1 minute ago

Sure, it's a public repo anyway, but yeah I'm not saying it's perfect - far from but it does work really reliable, I fail between hosts all the time without issue, but YMMV etc etc

Yeah absolute re: other recommendations too

If you can send me information about your VM that has PostgreSQL on it and your PostgreSQL.conf file I can make comments on any changes / concerns and explain each one

Notes regarding the above:

Slony was only used for upgrading from 9.2 to 9.6, we use streaming replication for our regular replication needs.
Rails already pools/re-uses connections, with 25 connections per process. With 20-ish processes for 20 hosts this produces 10 000 max connections, though there will only be 400-ish active concurrently (as Unicorn is single threaded)
For load balancing, pooling, better failover, etc, we’re looking into pgpool + streaming replication, with synchronous commits (for data consistency). Pgbouncer does not do load balancing (at least out of the box) as far as we’re aware, only connection pooling. https://github.com/awslabs/pgbouncer-rr-patch might also be an option.