Sometimes programmers want to prune out characters from a string of characters. For example, maybe you want to remove all line-ending characters from a piece of text.
Let me consider the problem where I want to remove all spaces (‘ ‘) and linefeed characters (‘\n’ and ‘\r’).
How would you do it efficiently?
size_t despace(char* bytes,size_t howmany){size_t pos =0;for(size_t i =0; i < howmany; i++){char c = bytes[i];if(c =='\r'|| c =='\n'|| c ==' '){continue;} bytes[pos++]= c;}return pos;}
This code will work on all UTF-8 encoded strings… which is the bulk of the strings found on the Internet if you consider that UTF-8 is a superset of ASCII.
That’s simple and should be fast… I had a blast looking at how various compilers process this code. It ends up being a handful of instructions per processed byte.
But we are processing bytes one by one while our processors have a 64-bit architecture. Can we process the data by units of 64-bit words?
There is a somewhat mysterious bit-twiddling expression that returns true whenever your word contains a zero byte:
(((v)-UINT64_C(0x0101010101010101))&~(v)&UINT64_C(0x8080808080808080))
All we need to know is that it works. With this tool, we can write a faster function…
uint64_t mask1 =~UINT64_C(0)/255*(uint64_t)('\r'); uint64_t mask2 =~UINT64_C(0)/255*(uint64_t)('\n'); uint64_t mask3 =~UINT64_C(0)/255*(uint64_t)(' ');for(; i +7< howmany; i +=8){memcpy(&word, bytes + i,sizeof(word)); uint64_t xor1 = word ^ mask1; uint64_t xor2 = word ^ mask2; uint64_t xor3 = word ^ mask3;if(haszero(xor1)^ haszero(xor2)^ haszero(xor3)){// check each of the eight bytes by hand?}else{memmove(bytes + pos, bytes + i,sizeof(word)); pos +=8;}}
It is going to be faster as long as most blocks of eight characters do not contain any white space. When this occurs, we are basically copying 64-bit words one after the other, along with a moderately expensive check that our superscalar processors can do quickly.
Can we do better? Sure! Ever since the Pentium 4 (in 2001), we have had 128-bit (SIMD) instructions.
Let us solve the same problem with these nifty 128-bit SSE instructions, using the (ugly?) intel intrinsics…
__m128i spaces = _mm_set1_epi8(' '); __m128i newline = _mm_set1_epi8('\n'); __m128i carriage = _mm_set1_epi8('\r');for(; i +15< howmany; i+=16){ __m128i x = _mm_loadu_si128((const __m128i *)(bytes + i)); __m128i xspaces = _mm_cmpeq_epi8(x,spaces); __m128i xnewline = _mm_cmpeq_epi8(x,newline); __m128i xcarriage = _mm_cmpeq_epi8(x,carriage); __m128i anywhite = _mm_or_si128(_mm_or_si128(xspaces,xnewline),xcarriage);int mask16 = _mm_movemask_epi8(anywhite);// contains 16 bits, 1 = is whiteif(mask16 ==0){// no match! _mm_storeu_si128((__m128i *)(bytes + pos),x);// just recopy pos +=16;}else{// we need to permute the bits x = _mm_shuffle_epi8(x,_mm_loadu_si128((const __m128i *) despace_mask16 + mask16)); _mm_storeu_si128((__m128i *)(bytes + pos),x); pos +=16- _mm_popcnt_u32(mask16);// popcount!}}
The code is fairly straight-forward if you are familiar with SIMD instructions on Intel processors. I have made no effort to optimize it… so it is possible, even likely, that we could make it run faster. Unrolling the loop is a likely candidate for optimization.
Let us see how fast it runs as is!
I designed a benchmark using a recent (Skylake) Intel processor over text entries where only a few characters are white space.
regular code | 5.85 cycles / byte |
using 64-bit words | 2.56 cycles/byte |
SIMD (128 bits) code | 0.80 cycles / byte |
memcpy | 0.08 cycles / byte |
So the vectorized code is over seven times faster. That’s pretty good. I am using 128-bit registers, so I load and save blocks of 16 bytes. It would be foolish to expect to go 16 times faster, but I was hoping to be 8 times faster… being 7 times faster is close enough.
Yet pruning a few spaces is 10 times slower than copying the data with memcpy. So maybe we can go even faster. How fast could we be?
One hint: Our Intel processors can actually process 256-bit registers (with AVX/AVX2 instructions), so it is possible I could go twice as fast. Sadly, 256-bit SIMD instructions on x64 processors work on two 128-bit independent lanes which make algorithmic design more painful.
My approach using 64-bit words is a bit disappointing, as it is only twice as fast… but it has the benefit of being entirely portable… and I am sure a dedicated programmer could make it even faster.