I'm developing a bioinformatics tool and I'm trying to use SIMD to boost its speed.
Given two char arrays of length 16, I need to rapidly count the number of indices at which the strings match. For example, the two following strings, "TTTTTTTTTTTTTTTT" and "AAAAGGGGTTTTCCCC", match from 9th through 12th positions ("TTTT"), and therefore the output should be 4.
As shown in the following function foo (which works fine but slow), I packed each characters in seq1 and seq2 into __m128i variables s1 and s2, and used _mm_cmpeq_epi8 to compare every position simultaneously. Then, using popcnt128 (from Fast counting the number of set bits in __m128i register by Marat Dukhan) to add up the number of matching bits.
float foo(char* seq1, char* seq2) {
__m128i s1, s2, ceq;
int match;
s1 = _mm_load_si128((__m128i*)(seq1));
s2 = _mm_load_si128((__m128i*)(seq2));
ceq = _mm_cmpeq_epi8(s1, s2);
match = (popcnt128(ceq)/8);
return match;
}
Although popcnt128 by Marat Dukhan is a lot faster than naïvely adding up every bit in __m128i, __popcnt128() is the slowest bottleneck in the function, taking up about 80% of the computational speed. So, I would like to come up with an alternative to popcnt128.
I tried to interpret __m128i ceq
as a string and to use it as a key for a precomputed look-up table that maps a string to the total number of bits. If char array were hashable, I could do something like
union{__m128i ceq; char c_arr[16];}
match = table[c_arr] // table = unordered map
If I try to do something similar for strings (i.e. union{__m128i ceq; string s;};
), I get the following error message "::()’ is implicitly deleted because the default definition would be ill-formed". When I tried other things, I ran into segmentation faults.
Is there any way I can tell the compiler to read __m128i as string so I can directly use __m128i as a key for unordered_map? I don't see why it shouldn't work because string is a contiguous array of chars, which can be naturally represented by __m128i. But I couldn't get it to work and unable to find any solution online.
You're probably doing this for longer sequences, multiple SIMD vectors of data. In that case, you can accumulate counts in a vector that you only sum up at the end. It's a lot less efficient to popcount every vector separately.
See How to count character occurrences using SIMD - instead of _mm256_set1_epi8(c);
to search for a specific character, load from the other string. Do everything else the same, including
counts = _mm_sub_epi8(counts, _mm_cmpeq_epi8(s1, s2));
in the inner loop, and the loop unrolling. (A compare result is an integer 0 / -1, so subtracting it adds 0 or 1 to another vector.) This is at risk of overflow after 256 iterations, so do at most 255. That linked question uses AVX2, but the __m128i
versions of those intrinsics only require SSE2. (Of course, AVX2 would let you get twice as much work done per vector instruction.)
Horizontal sum the byte counters in the outer loop, using _mm_sad_epu8(v, _mm_setzero_si128());
and then accumulating into another vector of counts. Again, this is all in the code in the linked Q&A, so just copy/paste that and add a load from the other string into the inner loop, instead of using a broadcast constant.
Can counting byte matches between two strings be optimized using SIMD? shows basically the same thing for 128-bit vectors, including a version at the bottom that only does SAD hsums after an inner loop. It's written for two input pointers already, rather than char and string.
You don't need to count all the bits in your __m128i
; take advantage of the fact that all 8 bits in each byte are the same by extracting 1 bit per element to a scalar integer. (x86 SIMD can do that efficiently, unlike some other SIMD ISAs)
count = __builtin_popcnt(_mm_movemask_epi8(cmp_result));
Another possible option is psadbw
against 0 (hsum of bytes on the compare result), but that needs a final hsum step of qword halves, so that's going to be worse than HW popcnt. But if you can't compile with -mpopcnt
then it's worth considering if you need baseline x86-64 with just SSE2. (Also you need to negate before psadbw, or scale the sum down by 1/255...)
(Note that the psadbw strategy is basically what I described in the first section of the answer, but for only a single vector, not taking advantage of the ability to cheaply add multiple counts into one vector accumulator.)
If you really need the result as a float
, then that makes a psadbw
strategy less bad: you can keep the value in SIMD vectors the whole time, using _mm_cvtepi32_ps
to do packed conversion on the horizontal sum result (even cheaper than cvtsi2ss
int->float scalar conversion). _mm_cvtps_f32
is free; a scalar float is just the low element of an XMM register.
But seriously, do you really need an integer count as a float
now? Can't you at least wait until you have the sum across all vectors, or keep it integer?
-mpopcnt
is implied by gcc -msse4.2
, or -march=native
on anything less than 10 years old. Core 2 lacked hardware popcnt, but Nehalem had it for Intel.