Quantcast
Channel: Hacker News
Viewing all articles
Browse latest Browse all 25817

Migrating from DynamoDB to Postgres – Why and How

$
0
0

I’m not a database guy, I’m a node guy. I’ve been writing node for 4 years and like any other true node fanboy, if I need to use a database, I go straight to NoSQL. Earlier this year I was tasked with migrating my company’s data away from DynamoDB to something new of my choosing. This is the story of how I, against my natural instincts, chose a relational database.

You may be asking yourself, “if they’re big node guys, why weren’t they using Mongo?” That decision was made before my time, but the answer can be found here. Although since version 3.4.x Mongo passes all Jepsen tests, when our original database was chosen the current version of Mongo at the time did not and was deemed unreliable by the team. This is also why you will not hear me mention it as one of the options I was exploring later in this post.

The Rise and Fall of DynamoDB

I’ll start with why we were leaving DynamoDB, also for clarity’s sake if I just call it Dynamo at any point know I am talking about the DynamoDB NoSQL database and not the old Amazon Dynamo key/value store that it is based on. Dynamo does some really nice things for you, handling replication etc, you just pay for throughput which can be really handy. Scaling in terms of the data storage it handles great as well, but if you think about scaling in the way of your app growing and changing, this is where Dynamo really falls apart.

Our original Cloud API was written over 2 years ago and was significantly smaller and simpler than what we run today. As the company and product grew, so did the API. Now we had these tables that were defined years before and did not meet our current query requirements. If you know your way around Dynamo you know about the hash and range keys. These are poorly named keys that function as essentially this: the hash key determines where your data is stored and the range key determines how its sorted. This is where it gets confusing: hash keys are required and range keys are not. So if you’re just giving everything a unique primary key, then thats the hash key. But if you want to be able to query on something, for example if you wanted to get all books by a certain author, you would now make the hash key the author (not unique) and the range key the book id (unique). All that matters is in the end the two together create a unique ID, but its very confusing which ID should hold which value in the two different cases in order to allow proper querying. Amazon also likes to use the terms partition and sort key instead of hash and range to add to the confusion.

Assuming you know how your data needs to be queried when you create your table, then the hash/range key combination could be the answer to your query problems. However we did not. When our tables were created we only needed to get data by its primary key, which was simple and fast. As the API grew we needed to get groups of data based on many other attributes, this led us to only two options: scans or global secondary indexes (GSIs).

Scans get all of the data from the table. They are incredibly slow and inefficient, and are also one of the only ways to get the data you need if your tables aren’t defined exactly how you need them from the start. As you grow the scans will become slower and slower, digging you deeper into your low latency database hole. GSIs are essentially just another table that contains projections of your data from another table, but it has different hash/range keys allowing you to later define a key to query on. This saves you from the devastating latency hits of scans, but you are now paying double for all of the data ending up in the index and the throughput to put it there. Also the data in the index is not guaranteed to be the latest since applications write to the original tables and Dynamo uses an eventual consistency model for projecting the data to the indexes.

Needless to say, unless you’re only getting data by its primary key, or have very few queries to run on the data itself (and you know these from the start!) Dynamo does not allow your application to grow. Not to mention there is no way to have the data returned sorted by anything other than the originally defined range (aka sort) key.

Finding a New DB

When it came to the different types of NoSQL databases, it was pretty clear that a document store best suited our needs. We needed to be able to query based on the data so key/value was out, there was no need to have the ability to make all of the complex connections between data of a graph db, and nobody actually knows what a column db is (jk but our data is not the right use case for it). So document it was. We would have to change our data layout a bit to really work well with most document databases, our current data structure was similar to what you would expect for a relational db with tables etc since dynamo also uses a table structure. I wanted something that was ACID, none of that eventual consistency nonsense. That took a lot of them out of the race right there.

I came across CouchDB, an open source document store by apache. Its ACID compliant. Score. All data is stored in JSON. Sick. High availability and reliability. Sweet. Http interactions only. Indexes called “views” could be defined at any time using a MapReduce JavaScript function that would store data in B-trees for fast and easy retrieval. I was pretty much sold. This was my DB. There was one last thing to check, replication. As I read about CouchDB’s replication it seemed to be exactly what I needed (easy master-master clustering? Im in.) until I reached the section on sharding. CouchDB only supports presharding (insert sad trombone noise here). You can easily add nodes, and move the shards between nodes as they grow, but at some point you’re going to need more shards. At which point you have to create a whole new cluster and do a complete database migration. This just wasn’t going to fly. I had to keep looking.

Switching to Relational

After some database soul searching it seemed like my dreams of requirements were doomed. Until I realized the answer was in front of me the entire time. It was how databases have worked since the 70s. Relationally. My need to always be using the newest flashy thing had clouded my view of what the real solution was. I wanted ACID compliance, reliability, fast queries, table schemas, literally all of the major features of a relational database! I quickly began researching and after throwing out a large chunk of possibilities due to their required proprietary licenses (get real Oracle and Microsoft), and finding the most reliable possibility I had my answer, PostgreSQL. Its open source, SQL standard compliant, and incredibly reliable. It even has indexable JSON data types allowing me to still nest my JSON data where necessary and even query on it! The major issue here once again ended up being replication. However in this case it was not a deal breaker. Replication with PostgreSQL is complex and there are a lot of options, but many of these options work for what we need (and you can always use RDS).

Migrating was fairly simple. Due to Dynamo’s table structure our overall data structure did not change. I wrote a script that took advantage of both the Vogels DynamoDB node module and the node-postgres module. They are both very clean and easy to use. I had to rename some columns that were using Postgres keywords as names and I had to add some many-many tables for normalization but overall, the change was smoother than if I had gone with any of the NoSQL databases I looked at. The change drastically increased the speed of our API, some requests are up to 6x faster than they were with Dynamo. The account details request, our previously slowest call, took around 12 seconds with Dynamo and was reduced to 2 seconds with Postgres! Although I think Dynamo is a good database for a specific solution, I do not recommend it for a growing and changing application that could need new indexes and queries for its ever expanding features at any time; this is where the flexibility and speed of a relational database really shined through for us.


Viewing all articles
Browse latest Browse all 25817

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>