I'm writing a tool for operations on long strings of 6 different letters (e.g. >1000000 letters), so I'd like to encode each letter in less than eight bits (for 6 letters 3 bits is sufficient)
Here is my code:
Rcpp::RawVector pack(Rcpp::RawVector UNPACKED,
const unsigned short ALPH_SIZE) {
const unsigned int IN_LEN = UNPACKED.size();
Rcpp::RawVector ret((ALPH_SIZE * IN_LEN + BYTE_SIZE - 1) / BYTE_SIZE);
unsigned int out_byte = ZERO;
unsigned short bits_left = BYTE_SIZE;
for (int i = ZERO; i < IN_LEN; i++) {
if (bits_left >= ALPH_SIZE) {
ret[out_byte] |= (UNPACKED[i] << (bits_left - ALPH_SIZE));
bits_left -= ALPH_SIZE;
} else {
ret[out_byte] |= (UNPACKED[i] >> (ALPH_SIZE - bits_left));
bits_left = ALPH_SIZE - bits_left;
out_byte++;
ret[out_byte] |= (UNPACKED[i] << (BYTE_SIZE - bits_left));
bits_left = BYTE_SIZE - bits_left;
}
}
return ret;
}
I'm using Rcpp, which is an R interface for C++. RawVector
is in fact vector
of char
's.
This code works just perfectly - except it is too slow. I'm performing operations bit by bit while I could vectorize it somehow. And here is a question - is there any library or tool to do it? I'm not acknowledged with C++ tools.
Thanks in advance!
You need to unroll your loop, so optimizer could make something useful out of it. It will also get rid of your if
, which kills any chance for quick performance. Something like this:
int i = 0;
for(i = 0; i + 8 <= IN_LEN; i += 8) {
ret[out_byte ] = (UNPACKED[i] ) | (UNPACKED[i + 1] << 3) | (UNPACKED[i + 2] << 6);
ret[out_byte + 1] = (UNPACKED[i + 2] >> 2) | (UNPACKED[i + 3] << 1) | (UNPACKED[i + 4] << 4) | (UNPACKED[i + 5] << 7);
ret[out_byte + 2] = (UNPACKED[i + 5] >> 1) | (UNPACKED[i + 6] << 2) | (UNPACKED[i + 7] << 5);
out_byte += 3;
}
for (; i < IN_LEN; i++) {
if (bits_left >= ALPH_SIZE) {
ret[out_byte] |= (UNPACKED[i] << (bits_left - ALPH_SIZE));
bits_left -= ALPH_SIZE;
} else {
ret[out_byte] |= (UNPACKED[i] >> (ALPH_SIZE - bits_left));
bits_left = ALPH_SIZE - bits_left;
out_byte++;
ret[out_byte] |= (UNPACKED[i] << (BYTE_SIZE - bits_left));
bits_left = BYTE_SIZE - bits_left;
}
}
This will allow optimizer to vectorize whole thing (assuming it's smart enough). With your current implementation i doubt any current compiler can find out, that your code loops after 3 written bytes and abuse it.
EDIT: with sufficient constexpr / template magic you might be able to write some universal handler for body of the loop. Or just cover all small values (like write specialized template function for every bitcount from 1 to let's say 16). Packing values bitwise after 16 bits is overkill.