How quickly can you remove spaces from a string?

Sometimes programmers want to prune out characters from a string of characters. For example, maybe you want to remove all line-ending characters from a piece of text.

Let me consider the problem where I want to remove all spaces (‘ ‘) and linefeed characters (‘\n’ and ‘\r’).

How would you do it efficiently?

size_t despace(char* bytes,size_t howmany){size_t pos =0;for(size_t i =0; i < howmany; i++){char c = bytes[i];if(c =='\r'|| c =='\n'|| c ==' '){continue;}
      bytes[pos++]= c;}return pos;}

This code will work on all UTF-8 encoded strings… which is the bulk of the strings found on the Internet if you consider that UTF-8 is a superset of ASCII.

That’s simple and should be fast… I had a blast looking at how various compilers process this code. It ends up being a handful of instructions per processed byte.

But we are processing bytes one by one while our processors have a 64-bit architecture. Can we process the data by units of 64-bit words?

There is a somewhat mysterious bit-twiddling expression that returns true whenever your word contains a zero byte:

(((v)-UINT64_C(0x0101010101010101))&~(v)&UINT64_C(0x8080808080808080))

All we need to know is that it works. With this tool, we can write a faster function…

uint64_t mask1 =~UINT64_C(0)/255*(uint64_t)('\r');
uint64_t mask2 =~UINT64_C(0)/255*(uint64_t)('\n');
uint64_t mask3 =~UINT64_C(0)/255*(uint64_t)(' ');for(; i +7< howmany; i +=8){memcpy(&word, bytes + i,sizeof(word));
    uint64_t xor1 = word ^ mask1;
    uint64_t xor2 = word ^ mask2;
    uint64_t xor3 = word ^ mask3;if(haszero(xor1)^ haszero(xor2)^ haszero(xor3)){// check each of the eight bytes by hand?}else{memmove(bytes + pos, bytes + i,sizeof(word));
      pos +=8;}}

It is going to be faster as long as most blocks of eight characters do not contain any white space. When this occurs, we are basically copying 64-bit words one after the other, along with a moderately expensive check that our superscalar processors can do quickly.

Can we do better? Sure! Ever since the Pentium 4 (in 2001), we have had 128-bit (SIMD) instructions.

Let us solve the same problem with these nifty 128-bit SSE instructions, using the (ugly?) intel intrinsics…

__m128i spaces = _mm_set1_epi8(' ');
 __m128i newline = _mm_set1_epi8('\n');
 __m128i carriage = _mm_set1_epi8('\r');for(; i +15< howmany; i+=16){
      __m128i x = _mm_loadu_si128((const __m128i *)(bytes + i));
      __m128i xspaces = _mm_cmpeq_epi8(x,spaces);
      __m128i xnewline = _mm_cmpeq_epi8(x,newline);
      __m128i xcarriage = _mm_cmpeq_epi8(x,carriage);
      __m128i anywhite = _mm_or_si128(_mm_or_si128(xspaces,xnewline),xcarriage);int mask16 = _mm_movemask_epi8(anywhite);// contains 16 bits, 1 = is whiteif(mask16 ==0){// no match!
        _mm_storeu_si128((__m128i *)(bytes + pos),x);// just recopy
        pos +=16;}else{// we need to permute the bits
        x = _mm_shuffle_epi8(x,_mm_loadu_si128((const __m128i *) despace_mask16 + mask16));
        _mm_storeu_si128((__m128i *)(bytes + pos),x);
        pos +=16- _mm_popcnt_u32(mask16);// popcount!}}

The code is fairly straight-forward if you are familiar with SIMD instructions on Intel processors. I have made no effort to optimize it… so it is possible, even likely, that we could make it run faster. Unrolling the loop is a likely candidate for optimization.

Let us see how fast it runs as is!

I designed a benchmark using a recent (Skylake) Intel processor over text entries where only a few characters are white space.

regular code	5.85 cycles / byte
using 64-bit words	2.56 cycles/byte
SIMD (128 bits) code	0.80 cycles / byte
`memcpy`	0.08 cycles / byte

So the vectorized code is over seven times faster. That’s pretty good. I am using 128-bit registers, so I load and save blocks of 16 bytes. It would be foolish to expect to go 16 times faster, but I was hoping to be 8 times faster… being 7 times faster is close enough.

Yet pruning a few spaces is 10 times slower than copying the data with memcpy. So maybe we can go even faster. How fast could we be?

One hint: Our Intel processors can actually process 256-bit registers (with AVX/AVX2 instructions), so it is possible I could go twice as fast. Sadly, 256-bit SIMD instructions on x64 processors work on two 128-bit independent lanes which make algorithmic design more painful.

My approach using 64-bit words is a bit disappointing, as it is only twice as fast… but it has the benefit of being entirely portable… and I am sure a dedicated programmer could make it even faster.

My C code is available.

How quickly can you remove spaces from a string?

Trending Articles

Bath man appears in court charged with attempted murder of a man...

MACLEAN, Allan

Black Angus Grilled Artichokes

Practice Sheet of Right form of verbs for HSC Students

Police blotter for Jan. 12

99 God Status for Whatsapp, Facebook

Rajasthan Board 12th Science Result 2018 name wise- RBSE 12th commerce result...

Notorious Naushad of Ippa gang nabbed

Child Kidnapping: Amy McNeil was kidnapped on her way to school by 5 adults;...

Sonible Smartlimit v1.1.5-R2R

NCERT Solutions for Class 9th Sanskrit Chapter 3 पाथेयम्

मतलबी दोस्त स्टेट्स | Matlabi Dost Status in Hindi – Selfish Friends Status

Arrow Flash 2 – Sinhala Dubbed – Episode 23 – 20th March 2016

[GET] AI Traffic Goldmine

[E² Plugin] HDF-Radio

Universal Multi-Patch v1.3 By RADIXX11

IWAN – Thanks and Praise ( Throw Back Thursday )

RONALD P SONDERGAARD Arrested by Miami-Dade County Corrections on Mar 03, 2017

मुख मैथुन से उठाएं सेक्स का भरपूर मज़ा, जानें क्या है इसका सही तरीकामुख मैथुन...

HSSC Excise & Taxation Inspector Result 2017 Scorecard/ Category Wise Merit List