I'm pondering at how to speed up bit testing in the following routine:
void histSubtractFromBits(uint64* cursor, uint16* hist){
//traverse each bit of the 256-bit-long bitstring by splitting up into 4 bitsets
std::bitset<64> a(*cursor);
std::bitset<64> b(*(cursor+1));
std::bitset<64> c(*(cursor+2));
std::bitset<64> d(*(cursor+3));
for(int bit = 0; bit < 64; bit++){
hist[bit] -= a.test(bit);
}
for(int bit = 0; bit < 64; bit++){
hist[bit+64] -= b.test(bit);
}
for(int bit = 0; bit < 64; bit++){
hist[bit+128] -= c.test(bit);
}
for(int bit = 0; bit < 64; bit++){
hist[bit+192] -= d.test(bit);
}
}
The actual gcc implementation does a range-check for the bit argument, then &-s with a bitmask. I could do it without the bitsets and with my own bit-shifting / masking, but I'm fairly certain that won't yield any significant speedup (tell me if I'm wrong and why).
I'm not really familiar with the x86-64 assembly, but I am aware of a certain bit test instruction, and I am aware that it's theoretically possible to do inline assembly with gcc.
1) Do you think it at all worthwhile to write an inline-assembly analogue for the above code?
2) If yes, then how would I go about doing it, i.e. could you show me some basic starter code / samples to point me in the right direction?
As far as I can tell, you basically iterate over each bit. As such, I'd imagine simply shifting and masking off the LSB every time should provide good performance. Something like:
uint64_t a = *cursor;
for(int bit = 0; a != 0; bit++, a >>= 1) {
hist[bit] -= (a & 1);
}
Alternatively, if you expect only very few bits to be set and are happy with gcc specific stuff, you could use __builtin_ffsll
uint64_t a = *cursor;
int next;
for(int bit = 0; (next = __builtin_ffsll(a)) != 0; ) {
bit += next;
hist[bit - 1] -= 1;
a >>= next;
}
The idea should be fine, but no warranty for the actual code :)
Update: code using vector extensions:
typedef short v8hi __attribute__ ((vector_size (16)));
static v8hi table[256];
void histSubtractFromBits(uint64_t* cursor, uint16_t* hist)
{
uint8_t* cursor_tmp = (uint8_t*)cursor;
v8hi* hist_tmp = (v8hi*)hist;
for(int i = 0; i < 32; i++, cursor_tmp++, hist_tmp++)
{
*hist_tmp -= table[*cursor_tmp];
}
}
void setup_table()
{
for(int i = 0; i < 256; i++)
{
for(int j = 0; j < 8; j++)
{
table[i][j] = (i >> j) & 1;
}
}
}
This will be compiled to SSE instructions if available, for example I get:
leaq 32(%rdi), %rdx
.p2align 4,,10
.p2align 3
.L2:
movzbl (%rdi), %eax
addq $1, %rdi
movdqa (%rsi), %xmm0
salq $4, %rax
psubw table(%rax), %xmm0
movdqa %xmm0, (%rsi)
addq $16, %rsi
cmpq %rdx, %rdi
jne .L2
Of course this approach relies on the table being in cache.