Search code examples
c++clangvectorizationauto-vectorization

auto vectorization with modulo index?


I'm trying to make clang++ auto-vectorize a simple bit scrambling loop that does something like this:

for(int i = 0; i < sz; ++i) {
   dst[i] = src[i] ^ key[i]
}

if dst, src and key are the same length, the compiler has no problem vectorizing this loop, but what I really want to do is this:

for(int i = 0; i < sz; ++i) {
   dst[i] = src[i] ^ key[i % 64];
}

I don't need key to be as long as the data, but when I add the % 64 the vectorizer runs away and I'm left with a normal loop. This happens even with % 8 which is the size of the SIMD registers. The next thing I tried was this:

char d = 0x80
for(int i = 0; i < sz; ++i) {
   dst[i] = src[i] ^ d;
   ++d;
}

but the vectorizer didn't like this as well.
Doing this however:

for(int i = 0; i < sz; ++i) {
   dst[i] = src[i] ^ 0x80;
   ++d;
}

did get vectorized fine, but having a key of just one byte is shorter than what I hoped for.

Is there a way to do something like this in a way that pleases the vectorizer?


Solution

  • I can reproduce this with Apple's (Xcode) clang. Using modulo-64 blocks appears to satisfy the vectorizer:

    int i = 0; /* current index. */
    
    int szd = sz / 64;
    int szm = sz % 64;
    for (int j = 0; j < szd; j++)
    {
        for (int k = 0; k < 64; i++, k++)
            dst[i] = src[i] ^ key[k];
    }
    
    for (int k = 0; k < szm; i++, k++)
         dst[i] = src[i] ^ key[k];