Search code examples
c++x86vectorizationavx2vector-class-library

How to gather arbitrary indexes in VCL with AVX2 enabled


I want to vectorize following code using gather instructions in VCL. Some operations should be performed on the indexes of vSource defined by other vector VInd:

vector<int> vSource;
vector<int> vInd;
for (auto i = 0; i < vSource.size();i++) {
    vSource[ vInd[i] ]; //some work
}

vInd contains completely random indexes, so I cannot shuffle them or do other cheap workaround. Desired output example:

vector<int> vSource = {1,2,3,4,5,6,7,8,9,10,11,12,13};
vector<int> vInd = {2,1,5,3,10,5,8,2,10,2,5,3};
3   2   6   4   11  6   9   3   

I can vectorize my code using AVX2.

void intrinGather(vector<int> & vSource, vector <int> & vInd) {
    __m256i ind = _mm256_loadu_si256((__m256i*) & vInd[0]);
    __m256i vec = _mm256_i32gather_epi32(&vSource[0], ind, 4);
}

However VCL version compiles only if I use compile-time indexes. How to pass arbitrary indexes as a parameter to VCL?

void VCLGather(vector<int> & vSource, vector<int> ind) {
    Vec8i vec;
    vec=gather8i<2,1,5,3,10,5,8,2>(&vSource[0]); //compiles
    //vec=gather8i<ind[0],ind[3],ind[2],ind[10],ind[6],ind[8],ind[7],ind[1]>(&vSource[0]); //doesn't compile
}

I'm perfectly fine with intrinGather function, but want to keep code in the same VCL-using style and features like multi-architecture code. Is it possible?


Solution

  • The VCL template function lookup<n>(index, table) is indeed intended for this purpose.

    VCL will search for the optimal implementation of your function. It will use a permute instruction rather than a gather instruction if n is not too big, because permute instructions are much faster than gather instructions. The n parameter is added in order to enable this optimization

    The lookup<n> templates are limiting each index to the interval 0 ≤ i < n for security reasons. If you don't want this security then you may set n = INT_MAX. I will change the code to make sure the interval check is optimized away in this case.