c x86 sse simd sse2

Intel load intrinsic issue

The purpose of the code is to subtract to each character of the string str a value in the key array. The non-vectorised version of the program corresponds to the last cycle in both programs. How is this code:

void decode(const char* key, int m, char* str) {
  int i; int n = strlen(str);
  __m128i k = _mm_loadu_si128((const __m128i*) key);
  for (int i = 0; i + 16 < n; i+=m) {
    __m128i s = _mm_loadu_si128((__m128i*) (str + i));
    s = _mm_sub_epi8(s, k);
    _mm_storeu_si128((__m128i*) (str + i), s);
  }
  for(; i<n; i++) str[i] -= key[i%m];
}

different from this?

void decode(const char* key, int m, char* str) {
  int i, n = strlen(str);
  char keybuf[16] = { 0 };
  memcpy(keybuf, key, m);
  __m128i k = _mm_loadu_si128((__m128i*)keybuf);
  for (i=0; i+16 < n; i += m) {
    __m128i s = _mm_loadu_si128((__m128i*)(str+i));
    s = _mm_sub_epi8(s,k);
    _mm_storeu_si128((__m128i*)(str+i), s);
  }
  for (; i<n; i++) str[i] -= key[i % m]; }

Without the memory copy the same code does not work the same way. I'm compiling with gcc -msse2. Why is the memory copy necessary?

Solution

The difference is that in the second case you are only loading m characters into keybuf, and the remaining elements stay initialised to 0. These additional elements then have no effect on str.

In the first version however you most likely have non-zero elements at the end of the vector, since you blindly load all 16 elements from key, regardless of the actual length of the key.

To make the first version work correctly you would need to mask out the final 16 - m elements of k, forcing them to be zero, e.g.

const int8_t mask[32] = { -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
                          0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 }; 
__m128i k = _mm_loadu_si128((const __m128i*) key); // load 16 elements
k = _mm_and_si128(k, _mm_loadu_si128((const __m128i*)&mask[16 - m]));
                                                   // mask out final 16 - m elements

(Note: there is probably a more efficient way of doing the masking, but it's the best I could come up with at short notice. It's still going to be more efficient than the memcpy version, I would guess. See this question and its answers for some other methods.)