Quantcast
Channel: Hacker News
Viewing all articles
Browse latest Browse all 25817

The AWS and MongoDB Infrastructure of Parse: Lessons Learned

$
0
0

This is the extended form of a comment that got some interest on Hackernews. After a grace period of one year, Parse is now offline. This is a collection of learnings and technical decisions that might be useful for other companies running cloud services. At least, it directly affects the design of our own Backend-as-a-Service Baqend.

So here are some facts and trivia that are not so well-known or published that I collected by talking to Parse engineers that now work at Facebook. As I am unsure about whether they were allowed to share this information, I will not mention them by name.

Users and Traction

  • 1 million apps were deployed to Parse.
  • The largest Parse app had 40M users.
The largest Parse customer only used it for Push notifications
  • Parse was the world’s largest MongoDB user
  • Clash of Kings used Parse for push notifications and made up roughly half of all pushes that went through Parse. They never moved any other parts to Parse, due to scalability concerns.
  • Original reason for Facebook to acquire Parse was to push their mobile SDKs and to create synergies with mobile ads. Parse was often sold as a package deal with Facebook advertising.
  • Static pricing model measured in guaranteed requests per second did not work well.
  • Business problem: people tended to remain in the free-tier.
  • Technical problem I: complicated rate limiting. If limit exceeded by a factor of 60 for a minute, requests were dropped. Limits were tracked using a shared Memcache instance. Consequence: when developers experienced rate limits in the API, they added retries. The retries incurred enormous load in the Parse backend.
  • Technical problem II: the real problem and bottleneck was not the API servers but almost always the shared MongoDB database cluster.

Parse Server

  • Server was Rails at first (with 24 threads max. concurrency) with very little throughput per server (~15–30 requests per second)
  • The server was later rewritten in Go. The open-source Parse server is written Node.js and lacks many functionalities of the original Parse server in Go.
  • Backend was completely on Amazon Web Services
  • It was planned to migrate Parse to Facebook’s infrastructure (e.g. Haystack, Tao, F4, Extended Apache Giraph, Gorilla) but the project was abandoned
  • Roughly 8 developers working on SDKs, 8 on the server, 8 DevOps + a few more engineers

Database

  • >40 MongoDB Replica Sets with 3 nodes each
Parse went for RocksDB as their primary storage engine.
  • Storage Engine: RocksDB (i.e. MongoRocks), an append-only engine based on log-structured merge trees (similar to e.g. Cassandra, HBase, CouchDB, LevelDB, WiredTiger, TokuDB). Reason: there is better handling of many collections — in contract to WiredTiger that uses one file for each collection. Compression was better by a factor of 2–3 in terms of space. Writes and replication also were more efficient in terms of latency/lag. The move to MongoRocks from MMap was done by adding a replica with MongoRocks that was later promoted as the new master.
  • Used only instance storage with SSDs, no EBS.
  • No sharding: each tenant was mapped statically to exactly one replica set using MongoDB’s primary database logic.
  • The Mongo Write Concern was 1 (!), i.e. writes were confirmed before they were replicated. Some people complained about lost data and stale reads
  • Slave reads were allowed for performance reasons
  • Partial updates were problematic as small updates to large docs got “write amplification” when being written to oplog
  • Frequent (daily) master reelections on AWS EC2. Rollback files were discarded and let to data loss
  • Developed a special “flashback” tool that recorded workloads that could later be rerun for internal load and functional testing
  • JS ran in forked V8 engine to enforce 15 second execution limit for user-provided code
  • No sharding automation: manual, error-prone process for largest customers
  • Indexing not exposed: automatic rule-based generation from slow query logs. Did not work well for larger apps.
  • Slow queries killed by cron job that polled Mongos currentOp and maintained a limit per (API-key, query template) combination
  • Backups: if important customers lost data due to human error, Facebook engineers would manually recover it from periodic backups
  • The object-level ACL system was highly inefficient. Numerours indexes were required that could sometimes surpass the actual data size by a factor of 3–4.
  • As there was no mechanism for concurrency control (except for minimal support for things like counters), applications were often inconsistent

What Parse should have done differently

Parse did a lot of things right. The documentation was great, the mobile SDKs were solid and the web UIs well-designed. However, they had an unspoken value system of not trusting their users to deal with complex database and architectural problems.

Coming from a database background, our idea is that developers should know about details such as schemas and indexes (the Parse engineers strongly agreed in hindsight). Also, we think that backend services are not limited to mobile apps but very useful for the web.

I think that providers should be open about their infrastructure and trade-offs, which Parse only was after it had already failed.

If this idea sounds interesting to you, have a look at Baqend. It is a high-performance BaaS that focuses on web performance through transparent caching and scalability through auto-sharding and polyglot persistence.

We strongly believe that architecture should not be a secret.

Viewing all articles
Browse latest Browse all 25817

Trending Articles