(This post was translated, the original version is in my Russian blog.)
This is a story about hunting a memory leak. A long story, because I go into much detail.
Why describe my adventures? Not that I wanted to save all those tiny code pieces and scripts only. It rather occurred to me that it was UNIX way which I had pursued. Every step was related to yet another small utility, or a library, which solves its task well. And finally I succeeded.
Also it was an interesting journey! Sometimes I got in and out of bed with thoughts about what happens to memory.
I am grateful to my colleagues in Shuttlerock company which actually worked (as in “being productive for customers”), while I was playing memory detective. Grats to the company itself too!
So we found ourselves in an unpleasant situation: our Rails app onHeroku platform had a memory leak. One can see it on this graph:
Now, every deploy or restart makes memory line go up fast (see around
3 AM, 8 AM). That’s totally normal, because Ruby is a dynamic language.
Some require
s are still being fired, something loads up. When everything is
loaded, memory consumption should be quasi-constant. Requests are coming,
objects get created, the line goes up. Garbage collector does its job – it goes down.
The app should be able to stay forever in such a dynamic balance.
But in our case there’s an apparent upward trend. A leak! When our Heroku 2X dyno (we use puma in cluster mode with 2 worker processes) reaches 1 GB, it becomes slow as hell, unresponsive and needs to be restarted.
Thou shalt not make experiments on the production server. First thing I did
was downloading latest production database dump and running the app in production
environment on my local Linux server which resides in the closet (you should have one
too!). Our app is a SaaS product and has a special middleware which detects customer
sites by domain, therefore I had to introduce small changes to app code enabling me to make
requests like so: curl http://localhost:3000/…
. Environment variables are
best for this purpose.
As you can see, if environment variable QUERY_SITE_ID
is set, site ID gets
taken from query parameter site_id
:
curl http://localhost:3000/?site_id=123
In most cases you’ll have to specify config.force_ssl = false
in config/environments/production.rb
, set DEVISE_SECRET_KEY
variable
(if you use devise), maybe something else.
The curl
command must finally work.
So, server works locally, what’s next? You have to supply incoming requests. The wonderful siege utility allows to put load on web servers in different modes and record statistics.
I decided to not hit a single route, but rather collect real URLs used by web clients.
Easy-peasy: run heroku logs -t | tee log/production.log
for some time,
then extract URLs from the log. I wrote a small utility which parsed log,
collected paths, site_id
values reported by our middleware and saved everything
into urls.txt like this:
You can create such a file by hand, or resort to awk, grep, sed.
Let’s run siege:
siege -v -c 15 --log=/tmp/siege.log -f urls.txt
Here siege will create 15 parallel clients and use urls.txt as its input.
If you do everything right, you’ll experience the memory leak. It can be seen
with top
and ps
utilities — the number to look for is called RSS (Resident Set Size).
To save myself from running them, though, I’ve added
the following code
to the app:
Accurate records with increasing RSS started to appear in the log. More onGC.stat[:heap_live_slots]
later.
After first experiments I swiched puma into single mode, because it leaked too and single process is easier to deal with.
Being certain in leak existence, I started playing the Sherlock Holmes part.
Some words need to be said about how MRI works with memory in general.
Objects are stored in a heap controlled by the interpreter. The heap consists of
separate pages, each being 16 Kb in size and every object using 40 bytes of memory.
When object is created, MRI searches for a free slot: if there’s none, extra
page is added to the heap. Now, not every object fits into 40 bytes. If it needs more,
additional memory is allocated (via malloc
).
Memory is freed automatically when garbage collector (GC) runs. Modern MRIs have quite effective incremental GC with generations and two phases: minor and major. Based on heuristic principle “Most objects die young”, minor GC tries to find unneeded objects only among newly created ones. This allows to run major GC less often, and this guy performs classic Mark-and-Sweep algorithm for ALL objects.
It must be noted that intricacies of different generations have nothing to do with hunting memory leaks. There’s one important thing: are all objects freed, which are created while processing web request, or not? Actually, in web server context all objects can be divided into three groups:
- Statics. All loaded gems, esp. Rails, and app code. All this gets loaded once in production environment and doesn’t change.
- Slow dynamics. There’s a certain amount of long-lived objects, e.g. cache of prepared SQL statements in ActiveRecord. This cache is per DB connection and is max. 1000 statements by default. It will grow slowly, and total number of objects will increase until cache reaches full size (2000 strings * number of DB connections)
- Fast dynamics. Objects created during request processing and response generation. When response is ready, all these objects can be freed.
In third case, if any object doesn’t get freed, you’ll have a leak. Here’s an illustration:
Constants are not garbage-collected, therefore consecutive MyController#index
calls
will lead to FOO
inflation. Heap will grow to accomodate more and more"haha"
strings.
If there are no leaks, heap size will oscillate. The minimum size corresponds to
objects from groups 1 and 2 (see above). For example, in our app this size is slightly
above 500,000, while an empty app created with rails new app
instantiates roughly
300,000 objects. Maximum heap
size depends on how often major GC runs. But: after every major GC the number of
objects is always back to the lowest. Leaks will lead to
low boundary increased over time. The GC.stat[:heap_live_slots]
figure reflects
current heap size.
The most convenient way to explore GC-related things is usinggc_tracer gem, built by Koichi Sasada, the Ruby developer team member and the author of incremental GC implementation in Ruby 2.1 and 2.2. Having added
to config/application.rb
, we get log/gc.log
file which is filled by GC stats and also getrusage
results
(the latter is useful because one of fields contains RSS figure which we are so interested in!).
Every line of this log has over 50 numbers, but here’s some simple UNIX magic to the rescue. The
command I ran along with puma
and siege
was:
tail -f log/gc.log | cut -f 2,4,7,8,9,15,16,19,21,22,25,26,27,28,29,30,36
which gives this:
First number in line is timestamp in milliseconds. Second is the number of pages. Third – heap size (in objects). The 11th (with values in 581634…613199 range) is the number of “old” objects, i.e. objects which are not inspected during minor GC run. The last number in line is RSS in kilobytes.
Still so many numbers! Let’s plot them. We could load this log directly into Excel (sorry, LibreOffice Calc), but that’s not classy. Let’s use gnuplot instead, which is able to plot directly from files.
Unfortunately, gnuplot doesn’t support timestamps in milliseconds, so I had to write a small script for timestamp conversion:
Along with turning milliseconds into seconds some extra information is discarded here. gc_tracer generates data on all stages of GC but we are interested only in the final data (first column contains “end_sweep”).
This gnuplot script
gets us this picture:
Red curve (left scale) displays the number of “old” objects. It behaves like there’s no leak. Blue curve (right scale) is RSS, which never stops rising.
The conclusion is: there are no leaks in Ruby code. But I wasn’t keen enough to grasp that at the moment and spent more time pursuing a false path.
Modern versions of MRI are equipped with powerful means for memory analysis. For instance, you can enable object creation tracing, and for each newly created object the location where it was instantiated (source file, line) will be saved and accessible later:
Now we can dump the heap into a file:
Location information will be dumped too, if it’s available. The dump itself is a JSON Lines file: data for each object are in JSON format and occupy one line in the file.
There are gems which make use of allocation tracing, likememory_profiler andderailed. I’ve decided to investigate what happens with the heap in our app as well. Having had no luck with memory_profiler, I went for generating and analyzing dumps myself.
First I wrote my own benchmark likederailed
does, but later switched to generating dumps from live app, usingrbtrace.
If you have gem 'rbtrace'
in your Gemfile
, here’s a way to generate dump:
Now let’s assume we have three dumps (1, 2, 3), generated at different moments in time. How can we spot a leak? The following scheme has been suggested: let’s take dump 2 and remove all objects which are present in dump 1. Then let’s also remove all objects which are missing from dump 3. What’s left is objects which were possibly leaked during the time between dump 1 and 2 creation.
I even wrote my own utility for this differential analysis procedure. It’s implemented in… Clojure, because I like Clojure.
But everything I could find was the prepared SQL statements cache mentioned above. Its contents, SQL strings, weren’t leaked, they just lived long enough to appear in dump 3.
Finally I had to admit that there are no leaks in Ruby code and had to look for them elsewhere.
I made following hypothesis: if there are no leaks on Ruby side, but memory is somehow still leaking, there must be leaks in C code. It could be either some gem’s native code or MRI itself. In this case C heap must be growing.
But how to detect leaks in C code? I tried valgrind on Linux andleaks on OS X. They haven’t got me anything interesting enough. Still looking for something, I’ve stumbled upon jemalloc.
Jemalloc is a custom implementation of malloc
, free
, and realloc
that is
trying to be more effective than standard system implementation (not to count
FreeBSD, where jemalloc is the system implementation). It uses a bag full of tricks
to achieve this goal. There’s an own page system with pages allocated via
mmap
; the allocator uses independent “arenas” with thread affinity allowing
for less synchronization between threads. Allocated blocks are divided
into three size classes: small (< 3584 bytes), large (< 4 Mb), and huge–
each class is handled differently. But, most importantly, jemalloc has
statistics and profiling. Finally, MRI 2.2.0 supports jemalloc natively!
The LD_PRELOAD
hack is not needed anymore (by the way, I couldn’t make it work).
I quickly went into installing jemalloc and then MRI with jemalloc enabled. Not without
trouble. Ubuntu and Homebrew-provided jemalloc library is built without profiling.
Also you can’t build certain Ruby gems with most recent jemalloc version 4.0.0,
released in August 2015. E.g. pg
gem doesn’t like C99 <stdbool.h>
included into<jemalloc/jemalloc.h>
. But everything is smooth with jemalloc 3.6.0.
You can follow this instruction to build MRI with rbenv, however at the end I opted for building it myself:
You’ll also need --with-openssl-dir=/usr/local/opt/openssl
for Homebrew.
This Ruby works as normal. But it reacts to MALLOC_CONF
environment variable:
When Ruby is done, you’ll get quite detailed memory allocation stats printed
to stderr
. I used this setting to run puma
overnight (with siege
attacking it). By morning RSS grew up to 2 gigs. Hitting Ctrl-C brought me to the stats
with allocated memory figure being close to same 2 gigs. Eureka!
The C code leak hypothesis has been confirmed!
Next question was: where in C code are those 2 gigs allocated? Profiling helped
here. The jemalloc profiler stores addresses which called malloc
, calculates
how much memory has been allocated from each address, and
stores everything to the dump. You can enable it with same MALLOC_CONF
, specifyingprof:true
flag. In this case final dump will be generated when the process exits.
Dump is to be analyzed with pprof
program.
Unfortunately, pprof
couldn’t decode addresses:
I had to subtract the start of code segment (this information is printed
when the process exists) from these numbers and use info symbol
command
in gdb
(e.g. info symbol 0x2b5f9
). The address appeared to belong to objspace_xmalloc function
(it’s declared static
, maybe that’s the reason for non-showing). A more representative
profile, with puma
being hit by siege
for 2 hours, showed that
this function allocated 97.9 % of total memory allocated. Now, the leak
has something to do with Ruby indeed!
Having become more certain in my search area, I’ve decided to investigate statistical
patterns of allocated blocks. Feeling not inclined to parse jemalloc textual
stats output, I wrote my own gem, jemal.
The main functionality is in Jemal.stats
method which returns all
statistics of interest as one big hash.
What was left is adding a small piece of code into the app:
…and run puma
and siege
overnight again.
By morning log/jemalloc.log
was big enough and could be analyzed.jq tool proved being extremely helpful. First I
decided to see how memory grows:
Look how UNIX way works here! jq
parses JSON in every line and outputsts
values and allocated
values in turn:
Then paste - -
joins its input line by line, separating two values with TAB:
Such a file can be fed to gnuplot
:
Linear growth! What’s with block sizes?
jq
can extract data even from deeply embedded structures.
The plot testifies that it’s small objects who’s being leaked. But that’s not all: jemalloc provides separate stats for different block sizes within one class! In small class, every allocated block has one of 28 fixed sizes (the requested size is simply rounded up): 8, 16, 32, 48, 64, 80, …, 256, 320, 384, …, 3584. Stats for every size are kept separately. By staring at the log I noted some anomaly with size 320. Let’s plot it as well:
Wow! Memory is consumed by objects of one size. Everything else is just constant, it’s
evident from the fact that lines are parallel. But what’s up with size 320?
Apart from total memory allocated, jemalloc calculates 8 more indices, including
the number of allocations (basically, malloc
calls) and deallocations.
Let’s plot them too:
Same indices for blocks with neighbor size, 256, are included for comparison. It’s apparent than blue and orange curves blended, which means that the number of deallocations for 256-sized blocks is approx. equal to the number of allocations (that’s what we call healthy!). Compare this to size 320, where the number of allocations (magenta curve) runs away from the number of deallocations (green curve). Which finally proves the memory leak existence.
We’ve squeezed everything we could from stats. Now the leak source is to be found.
I had no better idea than adding some debug prints to gc.c
:
Not wanting to drown in information, I started to run curl
by hand instead of
using siege
. I was interested in how many blocks get leaked during one request processing.
Yet another middleware was injected to the app:
I’ve used initializer (not config/application.rb
) here to ensure that the middleware
is exactly at the top of middleware stack.
Having run curl
a few times, I saw that M-D
value increases by 80-90
every request. Allocations and deallocations also appeared on stderr
(and in the
log by virtue of tee
). Cutting last portion of log between-------------------------
and MAM BEFORE ...
, I ran it through this
simple script:
Now here they are, the possible leak addresses:
There are quite many blocks with size 312. What does it mean? Nothing! I wanted to look at the memory contents and used gdb to connect to live process. So, taking some address and looking what’s there:
Looks like a bunch of pointers. Intel is little-endian, so most significant
byte is at the end and first line represents 0x7f1139dee134
number
(the 0x7f113..
thing made me believe it’s an address).
Helpful? Not much.
Then I wanted to see the backtraces of calls which allocated those blocks. Lazy googling revealed the following code which works with gcc:
I’ve added the call of this function to objspace_xmalloc
:
Then I repeated everything again, ran the possible leak detection script and started to look up the found addresses in the log, as backtraces were neatly printed near them.
And what I saw was…
More then a dozen addresses related to redcarpet.so
showed up.Now that’s who ate all our memory!!! It wasredcarpet gem, a Markdown to HTML renderer.
It was easy after finding the offender. I ran 10000 renderings in the console – leak confirmed. Ditched Rails, loaded the gem separately and repeated the same – the leak is still there!
The only place where redcarpet’s native code allocated memory through Ruby interface
is in rb_redcarpet_rbase_alloc
function which is basically a C constructor forRedcarpet::Render::Base
. The allocated memory wasn’t freed
during garbage collection. Quick googling revealed
an example of how to write such a constructor
correctly in tenderlove’s blog.
And the fix was simple.
Bingo!
- Probably, it shouldn’t have taken 2 weeks. I’d spend much less now.
- False paths are getting you out of the way, and you have to get back. Aside
from dumps I’ve also spent some time trying to configure garbage collector with
environment variables. The only thing I could take with me was
RUBY_GC_HEAP_INIT_SLOTS=1000000
. With this setting our app fits into the heap and doesn’t need more slots. - It seems that you can debug anything in our times. The number of helpful tools and libraries is incredible. If you don’t succeed, just try more.
Redcarpet guys still did nothing about my pull request as of now (Sep 21, 2015), but we’ve deployed the fix to production long ago. Compare this leak-y picture (same as in the beginning of this post)
with normal memory consumption graph:
Here’s no trend at all. As it should be.
Update on 2015/09/29. Redcarpet fix has been released.