The purpose of the code is to subtract to each character of the string str a value in the key array. The non-vectorised version of the program corresponds to the last cycle in both programs. How is this code:
void decode(const char* key, int m, char* str) {
int i; int n = strlen(str);
__m128i k = _mm_loadu_si128((const __m128i*) key);
for (int i = 0; i + 16 < n; i+=m) {
__m128i s = _mm_loadu_si128((__m128i*) (str + i));
s = _mm_sub_epi8(s, k);
_mm_storeu_si128((__m128i*) (str + i), s);
}
for(; i<n; i++) str[i] -= key[i%m];
}
different from this?
void decode(const char* key, int m, char* str) {
int i, n = strlen(str);
char keybuf[16] = { 0 };
memcpy(keybuf, key, m);
__m128i k = _mm_loadu_si128((__m128i*)keybuf);
for (i=0; i+16 < n; i += m) {
__m128i s = _mm_loadu_si128((__m128i*)(str+i));
s = _mm_sub_epi8(s,k);
_mm_storeu_si128((__m128i*)(str+i), s);
}
for (; i<n; i++) str[i] -= key[i % m]; }
Without the memory copy the same code does not work the same way. I'm compiling with gcc -msse2. Why is the memory copy necessary?
The difference is that in the second case you are only loading m
characters into keybuf
, and the remaining elements stay initialised to 0. These additional elements then have no effect on str
.
In the first version however you most likely have non-zero elements at the end of the vector, since you blindly load all 16 elements from key
, regardless of the actual length of the key.
To make the first version work correctly you would need to mask out the final 16 - m
elements of k
, forcing them to be zero, e.g.
const int8_t mask[32] = { -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 };
__m128i k = _mm_loadu_si128((const __m128i*) key); // load 16 elements
k = _mm_and_si128(k, _mm_loadu_si128((const __m128i*)&mask[16 - m]));
// mask out final 16 - m elements
(Note: there is probably a more efficient way of doing the masking, but it's the best I could come up with at short notice. It's still going to be more efficient than the memcpy
version, I would guess. See this question and its answers for some other methods.)