Quantcast
Channel: Hacker News
Viewing all articles
Browse latest Browse all 25817

How quickly can you remove spaces from a string?

$
0
0

Sometimes programmers want to prune out characters from a string of characters. For example, maybe you want to remove all line-ending characters from a piece of text.

Let me consider the problem where I want to remove all spaces (‘ ‘) and linefeed characters (‘\n’ and ‘\r’).

How would you do it efficiently?

size_t despace(char* bytes,size_t howmany){size_t pos =0;for(size_t i =0; i < howmany; i++){char c = bytes[i];if(c =='\r'|| c =='\n'|| c ==' '){continue;}
      bytes[pos++]= c;}return pos;}

This code will work on all UTF-8 encoded strings… which is the bulk of the strings found on the Internet if you consider that UTF-8 is a superset of ASCII.

That’s simple and should be fast… I had a blast looking at how various compilers process this code. It ends up being a handful of instructions per processed byte.

But we are processing bytes one by one while our processors have a 64-bit architecture. Can we process the data by units of 64-bit words?

There is a somewhat mysterious bit-twiddling expression that returns true whenever your word contains a zero byte:

(((v)-UINT64_C(0x0101010101010101))&~(v)&UINT64_C(0x8080808080808080))

All we need to know is that it works. With this tool, we can write a faster function…

uint64_t mask1 =~UINT64_C(0)/255*(uint64_t)('\r');
uint64_t mask2 =~UINT64_C(0)/255*(uint64_t)('\n');
uint64_t mask3 =~UINT64_C(0)/255*(uint64_t)(' ');for(; i +7< howmany; i +=8){memcpy(&word, bytes + i,sizeof(word));
    uint64_t xor1 = word ^ mask1;
    uint64_t xor2 = word ^ mask2;
    uint64_t xor3 = word ^ mask3;if(haszero(xor1)^ haszero(xor2)^ haszero(xor3)){// check each of the eight bytes by hand?}else{memmove(bytes + pos, bytes + i,sizeof(word));
      pos +=8;}}

It is going to be faster as long as most blocks of eight characters do not contain any white space. When this occurs, we are basically copying 64-bit words one after the other, along with a moderately expensive check that our superscalar processors can do quickly.

Can we do better? Sure! Ever since the Pentium 4 (in 2001), we have had 128-bit (SIMD) instructions.

Let us solve the same problem with these nifty 128-bit SSE instructions, using the (ugly?) intel intrinsics…

__m128i spaces = _mm_set1_epi8(' ');
 __m128i newline = _mm_set1_epi8('\n');
 __m128i carriage = _mm_set1_epi8('\r');for(; i +15< howmany; i+=16){
      __m128i x = _mm_loadu_si128((const __m128i *)(bytes + i));
      __m128i xspaces = _mm_cmpeq_epi8(x,spaces);
      __m128i xnewline = _mm_cmpeq_epi8(x,newline);
      __m128i xcarriage = _mm_cmpeq_epi8(x,carriage);
      __m128i anywhite = _mm_or_si128(_mm_or_si128(xspaces,xnewline),xcarriage);int mask16 = _mm_movemask_epi8(anywhite);// contains 16 bits, 1 = is whiteif(mask16 ==0){// no match!
        _mm_storeu_si128((__m128i *)(bytes + pos),x);// just recopy
        pos +=16;}else{// we need to permute the bits
        x = _mm_shuffle_epi8(x,_mm_loadu_si128((const __m128i *) despace_mask16 + mask16));
        _mm_storeu_si128((__m128i *)(bytes + pos),x);
        pos +=16- _mm_popcnt_u32(mask16);// popcount!}}

The code is fairly straight-forward if you are familiar with SIMD instructions on Intel processors. I have made no effort to optimize it… so it is possible, even likely, that we could make it run faster. Unrolling the loop is a likely candidate for optimization.

Let us see how fast it runs as is!

I designed a benchmark using a recent (Skylake) Intel processor over text entries where only a few characters are white space.

regular code5.85 cycles / byte
using 64-bit words 2.56 cycles/byte
SIMD (128 bits) code0.80 cycles / byte
memcpy0.08 cycles / byte

So the vectorized code is over seven times faster. That’s pretty good. I am using 128-bit registers, so I load and save blocks of 16 bytes. It would be foolish to expect to go 16 times faster, but I was hoping to be 8 times faster… being 7 times faster is close enough.

Yet pruning a few spaces is 10 times slower than copying the data with memcpy. So maybe we can go even faster. How fast could we be?

One hint: Our Intel processors can actually process 256-bit registers (with AVX/AVX2 instructions), so it is possible I could go twice as fast. Sadly, 256-bit SIMD instructions on x64 processors work on two 128-bit independent lanes which make algorithmic design more painful.

My approach using 64-bit words is a bit disappointing, as it is only twice as fast… but it has the benefit of being entirely portable… and I am sure a dedicated programmer could make it even faster.

My C code is available.


Viewing all articles
Browse latest Browse all 25817

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>