Part I: Little substring that could (not).
Few weeks ago a bug was filed against Dart SDK citing “very low performance of String.substring
”. Here is a core of microbenchmark developer submitted together with the issue:
Results are rather bad for Dart version:
$ dart substring.dart benchmarking with string of length 25000 substring(Dart): 244ms benchmarking with string of length 50000 substring(Dart): 949ms benchmarking with string of length 100000 substring(Dart): 3709ms $ node substring.js benchmarking with string of length 25000 substring(js): 2.566ms benchmarking with string of length 50000 substring(js): 2.308ms benchmarking with string of length 100000 substring(js): 2.633ms
Depending on your background you might also be surprised by a non-linear growth in Dart run times: increasing input string size by a factor of \(2\) (\(25000\) to \(50000\)) increases running time by a factor of \(4\) (\(244\) to \(949\) milliseconds).
Running benchmark with dart --observe
and looking at CPU profile in Observatory reveals an unsuprising picture:
Most of the time is spent doing a substring operation, so why Dart VM’s substring is so much slower than V8’s? They must be implemented completely differently - and indeed they are.
Dart VM implements String.substring(start, end)
in a very straightforward manner: it allocates a new String
object of length end - start + 1
and copies string contents into this new string. Formally speaking this is an \(O(n)\) implementation - also known as linear time - it requires amount of operations proportional to the length of a substring.
If we take a look at the core of our benchmark then it should become obvious why running times exhibit quadratic growth:
If we assuming that initial length of the string s
is \(N\) the the first iteration of the loop creates a substring of length \(N - 1\), next iteration creates substring of length \(N - 2\) and so on, until the last iteration creates a substring of length \(1\). This means the loop requires
operations, which is proportional to \(N^2\).
[If you have never encountered this sequence before I suggest to stop and think for a bit how a formula for its sum is derived. Solution’s simplicity and elegance might provide a welcomed retreat from fighting Webpack configs. The Prince of Mathematics Carl Friedrich Gauss managed to figure it out when he was 8 - though obviously we would never know what he would do facing modern JavaScript ecosystem.]
The \(O(N^2)\) complexity of the above loop is precisely the reason why it’s not a good idea to iterate a string by slicing away processed pieces, unless of course your runtime does “a bit” of magic internally and optimizes for this particular pattern… unsurprisingly V8 does.
If we take a look into V8 sources we discover a somewhat intimidating zoo of different string representations, each optimizing for some particular use case (indexing, concatenation, slicing):
Whenever you see a string value in your JavaScript code - it can actually be backed by any of those representations, runtime is able to operate with them interchangeably and even can go from one representation to another dynamically if that improves performance of some operation.
[Other JavaScript runtimes have similarly convoluted representation hierarchies spawned by benchmark races and desire to optimize for common usage patterns. See SpiderMonkey’s String.h
, JSC’s JSString.h
and *String.h
files in Chakra’s sources]
Understanding differences between different string representations used by different JS runtimes is usually a key to understanding why your string manipulation code performs in a certain way. For example we could digress and discuss why
runs 60 times faster in SpiderMonkey than in V8 - but unfortunately that’s beyond the scope of this post.
Lets return back to the String.prototype.substring
microbenchmark in question. Cutting through the layers of C++ and assembly we would eventually arrive to the code implementing substring operation:
[With V8 nothing is ever implemented just in a single place, so lost souls searching for substring implementation might also find themselves digging into CodeStubAssembler::SubString which constructs TurboFan graph that would be compiled down to machine code to serve as the SubString
stub - which essentially implements fast paths for the C++ logic above.]
Translating the C++ logic to Dart we would get something like this:
$ dart substring.dart benchmarking with string of length 25000 substring(Dart): 3ms benchmarking with string of length 50000 substring(Dart): 3ms benchmarking with string of length 100000 substring(Dart): 1ms
This is what JavaScript version is approximately doing and it explains why we don’t observe quadratic complexity on the microbenchmark: roughly speaking substring operations take constant amount of time (assuming that input string is flat and disregarding memory management overhead) and thus instead of
you get
Given such a drastic performance improvement why would not we implement a similar optimization in Dart VM? Well, this substring optimization comes with a dangerous trap built in: it leads to surprising memory leaks:
Counterintuitively obj
above retains the whole 10Gb input string instead of a small 20 character token - because this token is internally represented as a SlicedString
that points back to the source string. This might seem like a contrived example but leaks like this do tend to happen in the real world: for example three.js
had to work-around this issue. Eagerness of a runtime to fall over itself and make your code faster with clever hidden optimizations has an ugly side too. Issue 2869 tracks progress of fixing this on the V8 side, but nothing has really happened since 2013, probably because the only simple and robust solution is to remove sliced strings altogether. Interestingly that’s precisely what Java did - they used to implement String.substring
in \(O(1)\) time by reusing parent String
’s char[]
storage for the substring object but that lead to memory leaks and was eventually removed in 2012. V8 history with string slices is even more curious: originally V8 had them, then removed them in 2009, then added it back in 2011.
This all is somewhat bad news for the original Dart SDK bug which prompted this trip down the memory lane. Indeed Dart VM is unlikely to implement substring optimization through sliced-strings. That’s why I decided to probe: why exactly they are measuring substring performance?
Part II: Enter less_dart
.
Turns out that bug was prompted by their investigation into the slowness of less_dart
- which is a port of less.js
from JavaScript to Dart.
Looking at less_dart
benchmark in the Observatory reveals the following picture:
That is Less parser is doing exactly what I would not recommend - iterating through the input string using substring operation. Looking into the source reveals the following code and the reason for using substring becomes obvious:
[Code is ported word-for-word from JavaScript so exactly the same code exists in JavaScript version where its, ahm, suboptimality is hidden by the helpful JavaScript runtime]
Authors of Less decided to parse input using regular expressions instead of writing custom lexer, however in pre-ES6 world lexing with RegExp
-s was convoluted by the fact that you could not take a regular expression and easily check if it matches at the given position, which is what lexer needs to do as it advances through the string and breaks it into tokens.
match
can be easily implemented in any modern JavaScript interpreter that supports sticky RegExp flag introduced in ES6:
[The flag y
is apparently named after yylex
which is part of the Lex API (not to be confused with Lexx which is a spaceship based on the organic technology which is far beyond even things included into ES7). The name sticky in turn was apparently chosen because it ends with y.]
However what is the way to implement match
in a pre-ES6 JS engine? A very naive approach would be to do something like this:
This however is extremely inefficient because calling match(/\d+/g)
would essentially search from this.idx
forward for the first sequence of digits and then discard the match unless it occurred at this.idx
.
A person with a bit more insight into RegExp
features might come up with the following optimization:
However a much more common and straightforward way to implement match
would be to use substring
and anchored regexps:
This is exactly the code we see in the less.js
and less_dart
- fortunately for less.js
V8 runtime implements String.prototype.substring
as a \(O(1)\) operation, unfortunately for less_dart
Dart VM does not.
What Dart VM does implement is RegExp.matchAsPrefix
method - which essentially performs a sticky match of the given regular expression at the given position. Hooray! ☺
However when I did necessary changes to the less_dart
code I discovered that it actually became several times slower. Hmm. ☹
It turns out there was a little TODO
inside VM’s RegExp implementation:
Obviously the solution was to fix VM by implementing RegExp.matchAsPrefix
in the same way V8 implements sticky flag - which is rather simple because Dart VM is using essentially the same regexp engine as one in V8 called Irregexp.
[Maybe we should have called Dart VM port IrIrRegexp because instead of translating RegExp specific intermediate representation (IR) down to the machine code Dart VM translates it further into our machine independent IR and let the generic compiler pipeline take care of it. This allowed us to incorporate Irregexp into Dart VM without porting any machine specific regexp related backend code.]
So I just went and talked to Erik Corry who was one of the minds behind original Irregexp about his implementation of sticky flag. Then I just ported it to Dart VM… and found and fixed a very minor bug in V8’s sticky implementation while doing that.
With RegExp.matchAsPrefix
implemented in the efficient way my fixes in the less_dart gave an expected performance improvements brining less_dart
and less.js
neck to neck with each other.
Part III: The trap of RegExps.
I could have finished my post on this positive note, but I would like to circle back to the common idea of using regexps to tokenize the source code.
A very important thing to realize here is that while regular expressions are certainly a convenient way to lex - they are in no way the most efficient way to do it - there are simply too many abstractions layers involved.
If we take our previous attempts at tokenizing with regexps and benchmark them on a simple string like ("aaaaa aaaa ".repeat(50) + "10 ").repeat(10000)
then
we will see the following performance numbers:
# Naive tokenizer that just uses global regexps with lastIndex. $ node tokenize-global.js processed 2020001 tokens in 987ms # Tokenizer that is using a global regexps with an irrefutable pattern # to prevent searching for regexp match forward. $ node tokenize-global-irrefutable.js processed 2020001 tokens in 471ms # Tokenizer that uses anchored regexps and substring to slice # away the processed part of the string. $ node tokenize-substring.js processed 2020001 tokens in 393ms # Tokenizer that uses sticky regexps. $ node tokenize-sticky.js processed 2020001 tokens in 295ms
Can you go any faster? Lets erase abstraction barriers and go full manual:
Benchmarking this tokenizer with the very same benchmark I used for others:
reveals the following result:
$ node tokenize-manual.js processed 2020001 tokens in 50ms
The main thing this demonstrates - is that you program can get much faster if it’s doing less work. For example our manual tokenizer simply glides through the string without allocating any substring objects for tokens. However if we add var val = l.val;
inside the benchmark loop to force this meaningless allocation we will still see performance that is much better than our RegExp based parsers:
$ node tokenize-manual.js processed 2020001 tokens in 98ms
For the sake of completeness here is the same kind of lexer benchmark in Dart:
Dart VM does respectable job of this code:
$ dart tokenize.dart processed 2020001 tokens in 67ms
So overall performance advice can be summarized as follows:
- When using RegExps for lexing use sticky ones in node and use
RegExp.matchAsPrefix
in Dart VM. - Strongly consider writing your lexer by hand instead of using regular expressions.
- The less work your program is doing the faster it is. The less abstraction layers are involved the simpler it is to establish how much work your program is actually doing and decrease the amount.