Quantcast
Channel: Hacker News
Viewing all articles
Browse latest Browse all 25817

Riot Games Messaging Service

$
0
0

Now that we’ve got the basic architecture down, let's focus for a second on how we publish messages and how they’re delivered to the connected clients.

Messages intended for RMS follow the consistent format seen below. These simple JSON blurbs are published to the RMSR REST interface:

{
   "resource": "clubs/v1/clubs/665632A9-EF44-41CB-BF03-01F2BA533FE7",
   "service": "https://clubs.na.lol.riotgames.com",
   "version": "2016-10-12 09:33:43.1245",
   "recipients": ["3029B94E-412F-484C-B4E7-BD01073EA629", "81CA1B98-F906-4DDD-86CC-A3FF82A16505"]
}

Each message class is identifiable using the service and resource fields, which both follow REST resource modeling principles.

RMS can also deliver short payloads associated with messages. This is typically done if the updated state is relatively small, or to encode a REST method to use when contacting the publisher service.

{
   "resource": "clubs/v1/clubs/665632A9-EF44-41CB-BF03-01F2BA533FE7",
   "service": "https://clubs.na.lol.riotgames.com",
   "version": "2016-10-12 09:33:43.1245",
   "recipients": ["3029B94E-412F-484C-B4E7-BD01073EA629"],
   "payload": "{\"method\": \"GET\"}"
}

Once a published message is delivered to one of the RMSR servers, it’s parsed and validated. Its recipients list will be resolved using the global RMSR session table into a list of RMSE servers that hold at least one session for each recipient, and then stripped from the message. Next, the modified message will be forwarded to RMSE nodes that have at least one of the recipients logged on them.

Upon arriving at an RMSE node, a message will be forwarded to the recipient session handler, serialized, and pushed to the client via the WebSocket connection.

In the end, this is what the client receives:

{
   "resource": "clubs/v1/clubs/665632A9-EF44-41CB-BF03-01F2BA533FE7",
   "service": "https://clubs.na.lol.riotgames.com",
   "version": "2015-10-12 09:33:43.1245",
   "timestamp": 1444677045952,
   "payload": "{\"method\": \"GET\"}"
}

Internally, clients have a simple subscription table that allows their plugins to be notified of local resource state changes (see the previous article on LCU Architecture for more details). Whenever a new RMS message arrives, the client updates the resource field encoded in the message and all interested local consumers are notified of the update. Some plugins re-fetch their state by hitting the publisher through a concatenation of the service and resource fields, while others only consume the incoming event.

This service approach to messaging is extremely flexible and easy to integrate for teams across Riot. The actual integration with RMS for a new type of event requires a single REST request on the publish side and 2 lines of code in C++ in the client to receive and process the message. This enables immediate updates for players, faster development for Riot teams, and a standard, efficient, and elegant way to deliver async messages from servers to clients. Win-win-win!

Road to Production

We implemented RMS entirely in Erlang, which helped us enforce proper design patterns, simplified concurrency and distribution, and let us re-use many of the existing internal libraries we built for other services like the Chat Service.

Once built, RMS releases are packaged into Docker containers, giving us service isolation and relocatability, simplifying the deployments and configuration, and making us fairly independent from the underlying OS. For other examples of Docker usage at Riot, check out Maxfield Stewart’s post here.

The service is deployed to AWS, giving us enormous flexibility when it comes to scale, failure testing, and in-development experimentation. All AWS resources - soup to nuts - are managed by Terraform, an infrastructure automation tool that allows us to express our entire infrastructure as code. We can edit it programmatically, avoid costly human misclicks, version it, and much more. As an added bonus, we can spin up and tear down new RMS environments within a few minutes without having to manually configure each one. That allows us to create ephemeral environments just for testing then destroy them when we no longer need them.

RMS itself has been designed as a global service - although technically it can follow League of Legends’ sharding model, it is game-agnostic, and can handle multiple shards at once. It's also a fundamental messaging layer that's meant to be used by almost all backend player-facing services, so it has to be robust. We went to great lengths to make sure RMS is reliable, fast, scalable, and deployable with 0 downtime (more on that later).

Since the service’s CCU (Concurrently Connected Users) grows with the number of active clients, we wanted to establish a knowable capacity for a single RMSE and RMSR node, and then come up with some sort of formula that allowed us to scale the cluster when needed.

Load Testing

Unfortunately, load testing via running millions of actual League clients is not the most feasible option here (we would need thousands of servers to generate such load), so we came up with something better: a dedicated RMS Load Test Tool.

The Load Test Tool is responsible for simulating real world RMS scenarios. In simple terms, it hammers RMSE with tens of thousands of WebSocket connections, publishes tons of RMS messages addressed to said clients, then waits for the messages to arrive:

After running the full loop, it compares timestamps encoded in the message payloads and produces a number of useful statistics, such as publish-to-delivery latency histograms, publish response times, authentication durations, etc:

The Load Test Tool is also linearly scalable: each instance can hold ~60k active RMS connections (thanks to exhausting the ~2^16 - 1 number of TCP ports, less those reserved for system services), and we can grow the CCU by increasing the number of Load Test Tool boxes.

After a couple of rounds of write code -> deploy -> load test -> profile, and after fixing a few bottlenecks in the system, we established that the single largest RMSE machine (r3.8xlarge instance type) can handle at least 10M concurrent connections. This number - and the fact that we can easily add servers to increase RMSE capacity - greatly exceeded our expectations. We were thrilled to tick the load-test box on our checklist.

Fun fact: in order to get to 10M connections, we had to spin up 170 Load Test Tool servers. During testing, 170 terminals could not be squeezed into my MacBook's screen using csshX, so I had to split them up into 5 groups of terminals:

During the tests, a cluster of Load Test Tools was logging in almost 2000 clients per second, publishing and receiving over 20,000 messages per second, and consistently hitting ~15ms for end-to-end publish-to-receive scenarios.

Having a baseline for a single (albeit pretty beefy) server allowed us to define autoscaling policies that would kick in whenever the service is running out of capacity or wasting resources. At the end of the day we decided to run RMSE on much smaller instances to avoid putting all our eggs in one basket (losing 10M connections in case of an instance failure is not fun).

Fault Tolerance

Generally a lot of systems suffer from the fallacies of distributed computing. To address network reliability we took Jepsen, a framework for distributed systems verification with fault injection, and implemented a plugin for RMS. Jepsen was responsible for simulating random network failures: splitting RMSR clusters into random halves, introducing communication issues between RMSE and RMSR tiers, and messing with the cluster topology in the ways we couldn’t even imagine. This approach helped us catch a series of nasty bugs that would have caused big headaches if they hit production servers.

0 Downtime Deployments

Since a single RMS cluster handles multiple League of Legends shards, there is practically never an ideal time to take the service down without serious impact to players. In order to deploy new features and bug fixes in the smoothest way possible, we needed the ability to release new versions of RMS to production servers without interrupting existing traffic.

Let’s look at RMSE first. Since each RMSE node holds a persistent WebSocket connection to its players, we couldn't simply follow the blue green deployment paradigm and shut the nodes down whenever we please. To overcome this difficulty, we decided to bolster RMS with something called session durability. Whenever RMSE detects an unexpected disconnect of the client connection, or whenever RMSR detects an RMSE node loss, the affected sessions are put into a ‘durable’ mode. Messages addressed to affected players are cached in memory and delivered to the client on successful reconnect to the system. If the client doesn’t reconnect in the next few minutes or the system is close to running out of memory, durable sessions and their queued messages are purged.

A client that reconnects successfully won't even know it missed any messages, though its processing might be slightly delayed due to the reconnection itself. Failure to reconnect on time will result in the client re-syncing its state with all services since it might have missed important state updates while the RMS connection was down.

This technique allowed us to terminate entire RMSE nodes without draining their connections, making the RMSE tier rollout fast and easy.

For RMSR we’re able to spin up a secondary cluster, sync its state with the one that's currently running and handling production load, and route all traffic to the new boxes with one command.

Both RMSE and RMSR tiers can be deployed independently from one another with no player-facing impact, allowing us to introduce new features and fix bugs without costly maintenance windows.

Production Mode

Finally, with the open beta for the League client update, RMS has been put to the test. Real players started connecting to the service, services started publishing genuine messages, and the client began consuming them.

So far (knock on wood!) the service is performing as predicted: we haven't had a single failure or dropped message, we’ve been able to deploy to production clusters on multiple occasions, and our metrics reports look similar to our load tests (diligence: #worth).

Moving forward we would like more services to leverage RMS for player messaging, and once the AIR client is deprecated, have all players around the world connect to RMS!

I hope you enjoyed this post. if you have any questions or comments, please share them below. See you next time!


Viewing all articles
Browse latest Browse all 25817

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>