21 September 2016
TL;DR (especially for Bud Goode)
The .NET garbage collector is a complex beast. Occasionally it might infuriate but remember that it's keeping you from the misery of manual memory management and that you're better to consider it an ally than a foe.
Sometimes, ways to improve its performance seem counter-intuitive, such as intentionally keeping objects around that will have to be considered by each of the already-expensive gen 2 collections, even when we have no intention of letting those objects go (aka. object pooling) and such as using disk caching instead of in-memory caching, despite an in-memory cache "obviously" being more performant than having to hit the file system.
The deep dive
At work we have a service that handles queries from our hundreds of tourism websites and talks to the backend databases when, say, someone searches for Concerts in a particular location or wants to book a Hotel on a particular date. It caches many of the results of these queries for ten or fifteen minutes, which takes a lot of load away from the database servers and greatly reduces the average response times for users of the web sites. It handles a few million requests a day - so it's hardly Google but it's also doing enough work to be interesting from a performance point of view.
The load has been spread over a pair of servers for a few years now, initially for redundancy. However, there was a point at which it became clear that a single server could no longer handle the load. When there were too many concurrent requests, individual requests would take longer to be processed, which resulted in the number of concurrent requests going up and up and the individual request times following suit until requests started timing out. Over time, two servers became four servers and there is concern now that two servers could not reliably handle all of the load.
On top of this, the memory usage of the service on each of the servers appears to slowly-but-surely increase over time until it gets high enough and, for want of a more technical term, freaks out. The thread count in the service builds and builds as more request are backing up, waiting to be processed. The requests times get longer and longer. Using PerfMon, it looks like several CPU cores are tied up entirely on garbage collection (since we're using the "server mode" GC configuration, there is a separate managed heap - and a separate collection thread - for each processor). Strangely, at this point, the cores don't appear to be max'ing out, the average CPU usage for the server is reliatively low, though the "% time in GC" is high. Every few weeks, it seems like one of the servers would need the service restarting on it due to a "memory blow up".
The time finally came that we could no longer continue to brush this under the rug - this problem was not going to go away and the occasional "uh-oh, someone is going to have to restart the service again" was no longer acceptable; not only was there a minute or two downtime for the actual service restart, there was also half an hour or so leading up to it during which response times were getting unacceptably long.
Blaming the GC
It would seem all too easy to blame things on the garbage collector, say that the fault lies there and that there's nothing we can do about it other than avoiding giving it more load than it can deal with (in other words, throw more servers at the problem). But I recently read a tweet that said
Blaming perf issues on Garbage Collection is like blaming your hangover on your liver... Its the thing that's saving you from your code
(Courtesy of Ben Adams).
.. which helped motivate me into trying to find a better solution. On the whole, I'm very grateful that the .NET garbage collector works as well as it does. In the majority of cases, you just don't have to worry about it. But sometimes you do. Clearly, the data service that I'm talking about is one of those cases.
The garbage collector uses a range of factors to decide when to collect - some are simple, such as the amount of free memory available in the system (if there's a lot then there's less pressure to collect) and the available processor time (if the system is busy doing other work then it would ideal to wait until it's quieter before throwing GC work on top of the "real work"). Some factors are more complex - for example, if gen 2 collections occur that release zero (or very little) memory then the GC will take this into account and try to collect it less often (since gen 2 collections are the most expensive, it makes sense for the GC to avoid them if possible; if it finds that few references are being released from gen 2 each collections then there's little point doing the collection work).
However, there is a limit to how much the GC can deal with things with by magic. Sometimes you need to work with the garbage collector, rather than just presuming that it will be able to deal with anything you throw at it.
One way to help is to simply make fewer allocations. The less allocations that are made, the less work that there is for the garbage collector to do.
Another approach is take on board one of Ben Watson's key principles for enabling "high-performance garbage collection on servers", which is:
Objects Live Briefly or Forever
In order to think about how I can make the GC's life easier, it's worth taking a step back and describing a little bit more about what this troublesome service has to do.
What the service deals with (and why this might not be GC-friendly)
At its core, this service receives requests from websites, gets the required data for that request from an external source or from cache (if from an external source, such as a database, then the results will be cached for subsequent requests), massages the data into the desired form and sends it back to the client. For the majority of the time, the results include only "stubs" of the real data - a unique identifier, the name and its location (latitude, longitude). When the results of a query are cached, such as "get me all of the hotels in Liverpool that are in the city centre and have at least a three star rating, ordered by rating (descending)", the query and the resulting stubs are cached. The response to the client will include all of those stubs but it will also include full product details for one page of data - if the client says that it wants to show the first page of results to the user and each page shows ten items, then the first ten entries in the result set will be full products and the remaining "n - 10" entries will be stubs. There is a separate service that is responsible for retrieving full product details for given stubs.
The quickest way that the service may deliver these results is if the query results are stored in cache. In which case, the requests will be dealt with using the following steps:
- The request is deserialised
- Ordered stubs corresponding to the query are retrieved from the "Query Cache"
- Full product details are retrieved for the first page (as specified by the request) of results - this involves serialising a request for ten unique identifiers, sending it to the external product service and then receiving the details for those products back (which means that there's a step that involves deserialisation of full product records when the data comes over from the external service)
- The response (which consists of 10 full products and "n - 10" stubs) is serialised to be sent over the wire back to the client
The core of the service was written in (and has been in use since) 2009, which has a spread of advantages and disadvantages. On the less-good side, it uses .NET remoting (at the time, our servers only had .NET 2.0 installed and so newer technologies such as WCF were out of reach) and much of the serialisation uses the BinaryFormatter (which is unlikely to be anyone's go-to these days if they are interested in performance). On the other hand, over the years it's been proven to be largely reliable and it's been through a few optimisiation drives since the business is so reliant on it. So the places where serialisation performance is most important have had the BinaryFormatter replaced; anywhere that the stubs are serialised/deserialised, for example, uses custom methods to read/write the fixed fields in the stub type. Similarly, the "full product" records are serialised using custom routines (which is a double win when responding to a request since the full product instances must be deserialised when they are received from the external product service and then re-serialised to be included in the final response to the client, so that's twice that use of the slow BinaryFormatter is avoided).
What I'm trying to say here is that any "low hanging fruit" in terms of performance hotspots within the service code had been improved in the past. It genuinely did seem like it was the garbage collector that was responsible for much of the performance problem. (I did use the ANTS Performance Profiler on a local installation of the service under load to confirm this but it didn't reveal anything exciting). So it was firmly in the direction of the garbage collector that I faced.
I've written much of this service's code, so I'm familiar with its general structure as well as many of the finer details. With this knowledge, I captured a batch* of sample requests and set up a test environment that I could replay these requests against (using SqlProxyAndReplay to remove the database from the equation).
* (The test queries were taken from real web site logs and replayed at the same rate and level of concurrency - so they should be a reasonable approximation of real life load)
The plan being to try to tweak the code that was likely to offend the GC the most and measure after each change to see how it affected the work that the collector had to do. The first candidates were:
- The "Query Cache" needs to retrieve from, add to and remove from a structure that will be accessed concurrently be multiple threads. The very first implementation was a dictionary that required a lock for every read or write access. This was changed so that a lock was only required for write actions, which would clone the dictionary and overwrite the internal reference. Read actions wouldn't require a lock since no dictionary would ever change. However, this clone-for-every-write could mean a lot of churn.
- The custom serialisation uses binary data reader and writer classes. Each individual property value is serialised into a byte array (the BitConverter is used for many types of values and the UTF8Encoder is used for strings) and then these bytes are added to a List<byte> (and the reverse is done to deserialise; sections of the list are extracted into arrays and then translated back into property values). This means that there are a lot of arrays being allocated when serialising or deserialising.
- When serialising/deserialising the full product records, it seems very possible that these records could be over 85,000 bytes of serialised data, which would mean that there would be lots of byte arrays created on the Large Object Heap (where "lots" depends upon how many requests a second are being handled, how many full product records need to be retrieved for the requests and how many of those records were more than 85,000 bytes when serialised). Allocations to the Large Object Heap can be a source of headaches, which I'll go into in a little more detail later on.
Step 1: Bin the custom "free-reading dictionary"
There's a ConcurrentDictionary in .NET these days, which should improve things when compared to the custom structure we were using. Using it means that read and write actions both lock again but the locks are much more granular (they only affect subsets of the data, rather than there being a single huge lock around the entire reference) and so there is less likely to be contention between operations.
The batch of test queries were run against this change and the garbage collection frequency performance counters were captured. A few runs were performed and the results averaged and.. er.. annoyingly, I've lost my notes relating to this change! There were less collections required for each generation, which was promising. Thankfully I do have some useful information for the next changes :)
Step 2: Bin the custom binary reader and writer classes
The .NET library has BinaryReader and BinaryWriter classes that take a stream and read/write to it in a more efficient manner than the custom reader/writer classes used before (which allocated at least one array for every single property read or write). These aren't new classes, I just wasn't aware of them when I wrote the custom versions all that time ago.
The tests were repeated with this change and, compared to only the Query Cache change, there were on average 56% as many gen 0 collections, 60% as many gen 1 collections and 59% as many gen 2 collections.
Step 3: Pooling large byte arrays used in serialisation/deserialisation
Time to talk about the Large Object Heap. The garbage collector is much happier dealing with "small objects" (which are decided to be those less than 85000 bytes, based upon "a bunch of benchmarks" according to this excellent Stack Overflow answer by Hans Passant). With small objects, it will allocate them freely and then, after collections, compact the heap for objects that survive the collection. If the heaps are not compacted, then any gaps in between "live" objects (live objects are those that the GC finds to still be in use) could only be used to slot in newly allocated objects if they fit in the gaps. As objects are allocated and then tidied up, it can become more and more difficult to find somewhere to fit new allocations - it might be necessary to look at many small gaps before finding one that a new object will fit in (this problem is referred to as being caused by fragmentation of the heap). Compacting the heap moves all of the objects so that they're pushed up together, with no gaps, and is relatively cheap when dealing with small objects since each individual memory operation is cheap. However, copying big chunks of memory around (such as the live objects in the Large Object Heap), which is what would be required to compact the Large Object Heap, is much harder work. Following the same sort of logic (that large objects are more expensive to deal with), the Large Object Heap is only collected during a gen 2 collection.
If a lot of allocations are made to the Large Object Heap then memory can appear to spiral out of control (because the Large Object Heap is only collected in gen 2 and because it's not compacted) and the pressure on the GC will increase. Unfortunately, this can be done quite easily when frequently serialising/deserialising to arrays that break the 85,000 byte limit.
One solution is to "pool" those byte arrays. In other words, to maintain a set of arrays and to reuse them, rather than creating new ones each time (which the GC will have to work hard to tidy up after). It's not difficult to imagine that this could easily become a very complicated task - whatever is responsible for pooling those arrays would need be thread safe and it would have to apply some sensible logic to when and how to reuse arrays; Should all arrays be reused? Should only large arrays be reused? Should all large arrays be reused? Will there be any limits to the pool? What if the limits are exceeded and more arrays are required?
Interestingly, I read last year about something that might be ideal for the job in the book Writing High-Performance .NET Code (written by Ben Watson, who I quoted earlier - it's a book I highly recommend, btw). I'm going to lift the overview completely from the blog post Announcing Microsoft.IO.RecycableMemoryStream (which is a quote lifted from the book) -
In one application that suffered from too many LOH allocations, we discovered that if we pooled a single type of object, we could eliminate 99% of all problems with the LOH. This was MemoryStream, which we used for serialization and transmitting bits over the network. The actual implementation is more complex than just keeping a queue of MemoryStream objects because of the need to avoid fragmentation, but conceptually, that is exactly what it is. Every time a MemoryStream object was disposed, it was put back in the pool for reuse.
That sounds like a very similar use case to what I have. Lots of serialisation/deserialisation for transmitting and receiving data from other servers, with byte arrays large enough to be allocated on the Large Object Heap. All of them being wrapped in MemoryStreams (at least, MemoryStreams were used for serialisation of these large objects after Step 2, above, was implemented).
So this definitely seemed worth looking into.
Just to recap precisely why pooling large objects might help; pooling them means keeping hold of them in memory, which seems like the opposite of what we want to do if we want to relieve memory pressure. However, the big benefit is that the Large Object Heap fragmentation will be less of a problem because we're no longer allocating large objects and then throwing them away and then trying to allocate further large objects somewhere (such as in a gap that the GC has removed dead objects from or possibly resorting to tacking them on the end of the heap); instead, a MemoryStream (and its large backing array) may be reused after it's been created once and returned to the pool, so the work to try to find a place to allocate a new large object is not required. This still feels somewhat counterintuitive because it means that there will be more objects that the garbage collector has to consider when it does a gen 2 collection and we're trying to give the GC as little work as possible - particularly in gen 2, since collections there are most expensive. This is where the GC's self-tuning comes in, though. If we're trying to get towards a position where not many objects make it into gen 2 unless then are going to live forever (as pooled objects do) then the GC will be in a position where it has to evaluate the gen 2 heap but - ideally - find very little to remove. If it consistently finds very little to do then it will reduce the frequency of the gen 2 collections. So, even though it might feel like we're making the collectors life more difficult by keeping these objects alive on the gen 2 heap, we're actually making it easier.
With this change, after running the tests again, there were 61% as many gen 0 collections as after only Step 2, 53% as many gen 1 collections and 45% as many gen 2 collections. This means that Step 2 and Step 3 combined resulted in 34% as many gen 0 collections than after only the changes in Step 1, 32% as many gen 1 collections and 27% as many gen 2. This seemed very promising.
Testing under increased load
The sample data that I'd been using so far wasn't particularly large, it was around 10k requests that would complete in around ten minutes. This is comparable to the request/sec that the production servers deal with during the day. While each run took place, after the changes made above, the CPU usage averaged around 40% and the "% time in GC" averaged 2.5%. I had a feeling, though, that it would be while the server was having a really hard time that the original issues would occur. 40% average CPU usage is nowhere near running flat out and that remaining 60% provides a lot of head room for the garbage collector to come and do what it wants whenever it wants.
So I increased the load and duration. Not massively, but enough that the previous code started to get a little hot under the collar - 100k requests over an hour or so.
This sample load was run against both the new and the old versions of the service (where the old version was the code as it was before Steps 1, 2 and 3 from above were applied to it) and the performance metrics compared between the two. On average, the new version required only 84% as much CPU to be used, spent only 30% as much time in the GC, performed 62% as many gen 0 collections, 36% as many gen 1 collections and 22% as many gen 2 collection. Things were still looking promising.
Testing for the breaking point
At this point, it was feeling like a success.
To stretch things further, though, I thought that I'd see how it responded if I played the requests as fast as I could. In normal operation throughout the day, each server doesn't have to deal with much more than an average of 12 requests per second. There will be the odd peak of double that, but they tend to be short-lived. There will be busier times of day where the average rate may be more like 16 requests per second, but not usually for more than a few hours. I was only using a single computer to generate load in this case but that was sufficient to create a sustained load of 35-40 requests per second. I figured that if the service would deal with this then we'd be doing great.
And for about forty minutes, things do go great. The server is busy, it's serving a lot (relative to a normal load) of requests, the gen 0 heap peaks and troughs the most, the gen 1 heap blips up and down with less drama, the gen 2 heap makes slower steps up then drops back down then very gently climbs then steps up then is steady then steps up slightly then drops down, carrying on merrily enough.
Until, at some point, the gen 2 heap starts curving up dramatically, followed by many steep steps upward, then a slightly pathetic dip immediately followed by steep steps upward. Having barely hit a gigabyte in size while gently building up and dropping earlier, it's now got to around 4 gig in a very short space of time. Here, it flatlines. During this steep climb, requests have gotten slower and slower and, at this flatline, they are no longer processed. This state continues for a couple of minutes, after which some work appears to attempt to continue, though the gen 2 heap doesn't drop in size at all. Some unusual errors are seen in the logs, such as:
Timeout expired. The timeout period elapsed prior to obtaining a connection from the pool.
It's as if, during this time, everything within the service stopped. This isn't a query timeout that occurred because the database server was taking too long, this error suggests that a SqlConnection was requested (with .NET pools internally) and then the world stopped for some time.. after which, the request-for-a-connection gave up since it had been so long since it asked for it.
I had thought that the point of the GC server mode was to avoid this sort of thing; even if a collection for one heap was taking a long time, each core has its own separate heap (and this server has four cores - it's not a real server, they're all virtualised, but that shouldn't make a huge difference). Could all of the heaps really have got jammed up simultaneously? Hopefully from everything I've written above, it's clear that there are a lot of subtleties to the intricate nature of the garbage collector and so it wouldn't surprise me if I'd not quite got the whole picture with server mode (or if I was maybe expecting a little too much!).
Incidentally, after this "flatline period", as requests appear to (slowly) begin being served again, the gen 2 heap grows to over 5 gig and then another flatline period is entered. This one much longer. So long, in fact, that I gave up waiting for it. Maybe my attention span is a bit short but I think that after more than five minutes of being completely stuck it's probably not much use even if the service does start going again.
The symptoms of this condition sound identical to the occasional "memory blow up" that was seen with the old version of the code on the live servers. It would seem that the changes so far had not provided a magic bullet.
Looking for clues
I wanted some insight into what was going on during these periods of apparent inactivity - well, it seemed like my code was inactive, though it appeared that the GC was being very busy. The "% time in GC" would spike around during work but seem to get more frenzied in its peaking in the lead up to the gen 2 heap size flat line, then it too would flat line in sympathy. After the first flat line period, it would remain higher but spike up and down, then it would flatline again when the gen 2 heap size flatlined, at a higher level than the time before.
I initially presumed that there must be something in the requests that caused this behaviour. So, if I skipped the first {whatever} thousand requests then I should be able to get this to happen sooner. Not so - skipping 10k requests still meant that I had to wait the same period of time for the behaviour to present itself. Skipping 20k, the same. If I skipped too many then the requests would complete without blowing up at all.
My next step was to try to use ANTS Memory Profiler and to take a couple of snapshots as the blowout started occurring. Unfortunately, by the time that the gen 2 heap size started climbing sharply, it would quickly get too big for the profiler to snapshot. There's a hard limit in the software as to how big of a memory dump it will try to process ("for performance reasons"). There's an option to take less information about each object so that larger dumps may be taken but even enabling that didn't work. In retrospect, it might have been worth reducing the memory in the virtual box and trying to reproduce the issue then - hopefully ANTS would have been able to deal with it then (everything got stuck when the gen 2 heap reached around four gig out of a total six gig of RAM, if the server only had four gig total then the available memory would be exhausted and the GC would presumably throw a tantrum much earlier, with a much smaller gen 2 heap).
After that I tried using PerfView since it's discussed and recommended in the "Writing High-Performance .NET Code" book. I managed to take a snapshot using that, freezing the process while doing so in order to prevent the heaps growing even more (taking the snapshot took almost two hours). When I loaded the dump file into PerfView to analyse, it appeared to show very little information about what types were in use (certainly it didn't appear to show the long list of types seen in all of the screenshots and tutorial videos about PerfView). There is a small row of information at the top of the heap alloc stack window that shows a summary. This showed 99% unreachable memory. This means that most of the memory is actually ready to be reclaimed by the collector (ie. that its roots are unreachable) and so I presumed that I wouldn't be able to find out much information about it. I tried finding confirmation for this online but didn't come up with much when searching for "99% unreachable memory PerfView". Another regret, looking back, is that I didn't try a bit harder to unearth information through PerfView. To be completely honest, though, I was losing patience.
Giving up and guessing (I prefer to call it intuition)
I was frustrated now. I was frustrated with what I was seeing, I was frustrated because I didn't understand precisely what triggered it and I was frustrated that I couldn't get any tools to tell me what was going awry. So I thought I'd just take a stab in the dark and see what happened.
In my defence, it was more sort of an educated guess. It seemed like what the service was asking of the garbage collector was something that the collector would (given enough time) decide it didn't like. I didn't feel like it was just allocation churn, my gut* told me that references that were very short lived were not the problem, even if there were a lot of them coming into existence and then disappearing again while the request rate was high. It felt like it was all going to lie with those ten / fifteen minute caches. If the GC likes references to live for very short periods of time or to live forever then this is the worst case for it. It's particularly bad since there may be many (ie. 1000s of) sets of cached results in memory at any time and each result set could potentially hold many stubs (again, 1000s).
* (I say "my gut told me" but I think that what that really means is that my sub-conscious, having been stuffed full with a million articles about garbage collection, was just regurgitating information I'd read..)
The logical step, then, would be to move that cache out of process. Maybe Redis, Memcached.. something like that. This would mean that any Query Cache lookup would involve a leap out to another process. Some sort of cache key would have to be generated, any results from the other process would have to be deserialised and then compared against the original search criteria (unless the cache was just a serialised representation of the entire search criteria then there would always be a change of cache key collision, so a single cache key might actually correspond to results from multiple different searches). This seemed like a lot of extra work, compared to just accessing cached references in memory.. but it's this just-get-bang-it-in-memory approach that has gotten me into trouble in the first place!
At this point, I was in no way certain that this would solve my problems and so thinking about setting up an external cache service was starting to feel like an exercise in yak shaving. So I went for the simplest alternative, I implemented a disk cache layer. If I was going to use an external cache then I'd still need a way to serialise the data that would need caching (so that I could send and receive it over the wire) and I'd still need a way to generate cache keys from the search criteria (by hashing the options, basically). I would also have to do that if I was going to just stash the cache values on disk. There would be a few minor complications with a disk cache rather than an off-the-shelf external cache (such as ensuring that old cache files are deleted if they're not accessed again within a reasonable amount of time) but most of the work to implement a disk cache would come in handy if the hypothesis was proved and a general purpose out-of-process cache for these ten-to-fifteen-minute cache items seemed to help.
(Just in case it's not completely obvious why a disk cache might work here, it's because the data isn't stored in memory for long periods of time any more - any time that the cached data is read from disk into memory, the in-memory representation only lives for the live of the request that the cached data is helping with - it then is free to be collected, meaning that it should never get out of gen 0).
So I changed the Query Cache so that it didn't maintain a ConcurrentDictionary of data (meaning, unfortunately, that the work I did for "Step 1" earlier was a waste of time) and, instead, had a simple ICache dependency injected into it. Simple in that it would only have options to read or write serialised data (as byte arrays) for particular keys - the deserialisation and subsequent checking of an "ExpiresAt" time would be handled within the Query Cache class. The ICache implementation read and wrote files on disk, mapping the cache keys onto file names and running a single background thread to tidy up old files that hadn't been touched for a while. Writing an alternative ICache implementation to talk to Redis would be very easy.
With this change, I was able to run the entire 100k request sample without issue. In fact, the service has been updated in production using this disk cache. While there are some challenges and compromises with a disk cache*, it's working well enough for now that we're going to leave it be. If it seems like, in the future, that the overhead of persisting to disk is a bottleneck and that a dedicated external cache could significantly improve the performance of individual requests or the overall throughput of the system, then we may change to using one. However, right now, that would just be one more moving part. The advantage of the disk cache is that it's very simple.
* (File IO of this type will suffer contention issues but this is a read-only service and so the worst case is that some database hits that could theoretically have been avoided are processed; if a request comes in whose results are not available in cache then it will get the data live and then try to write to a cache file - if another request comes in whose search criteria gets hashed to the same key then it won't be possible to read the data for that key while the writing from the first request is taking place)
In conclusion
It has now been a couple of weeks that this code has been in production. Over that time, all of the gen 0, 1 and 2 Small Object Heaps have appeared to breathe in and out in a healthy fashion, as has the Large Object Heap. There has been no indication of the slow-memory-usage-climb-to-oblivion that would be seen before.
The experience has been very interesting for me, it's given me a chance to expand my understanding of the garbage collector and to apply what I already knew about it. It would have been the icing on the cake to find out more about just what was happening in the process when it was having one of its "blow ups", but I'm more glad that it doesn't look likely to happen again than I am curious as to what was in its mind at the time! It's given me a fresh appreciation of the garbage collector and it's served as a reminder that it really is my buddy and not my enemy.
It's also gratifying that this service continues to get the love it needs to grow and develop. It doesn't seem to be particularly uncommon for code to be written that doesn't expect to be in use more than two years in the future (sometimes simply because a project is released and "handed off" to a client, never to be maintained or updated again - necessitating its replacement in the not-too-distant-future as the real world moves further and further away from what the original solution is able to do).
I wrote such a large portion of the service code myself that I have to bear the blame for the bad bits as well as the glory for the successes. Those custom non-locking-for-read-but-fully-cloning-for-write dictionaries (replaced in "Step 1" with the more modern ConcurrentDictionary) were my idea and implementation and seemed fanstastic at the time - but I'm not upset in the slightest to have seen the back of them now! It's a great opoortunity to look back over the years and see not only how technology has moved on since then but also my own knowledge. I very much intend to see it continuing!
Posted at 20:12