I want to vectorize following code using gather instructions in VCL. Some operations should be performed on the indexes of vSource defined by other vector VInd:
vector<int> vSource;
vector<int> vInd;
for (auto i = 0; i < vSource.size();i++) {
vSource[ vInd[i] ]; //some work
}
vInd contains completely random indexes, so I cannot shuffle them or do other cheap workaround. Desired output example:
vector<int> vSource = {1,2,3,4,5,6,7,8,9,10,11,12,13};
vector<int> vInd = {2,1,5,3,10,5,8,2,10,2,5,3};
3 2 6 4 11 6 9 3
I can vectorize my code using AVX2.
void intrinGather(vector<int> & vSource, vector <int> & vInd) {
__m256i ind = _mm256_loadu_si256((__m256i*) & vInd[0]);
__m256i vec = _mm256_i32gather_epi32(&vSource[0], ind, 4);
}
However VCL version compiles only if I use compile-time indexes. How to pass arbitrary indexes as a parameter to VCL?
void VCLGather(vector<int> & vSource, vector<int> ind) {
Vec8i vec;
vec=gather8i<2,1,5,3,10,5,8,2>(&vSource[0]); //compiles
//vec=gather8i<ind[0],ind[3],ind[2],ind[10],ind[6],ind[8],ind[7],ind[1]>(&vSource[0]); //doesn't compile
}
I'm perfectly fine with intrinGather function, but want to keep code in the same VCL-using style and features like multi-architecture code. Is it possible?
The VCL template function lookup<n>(index, table) is indeed intended for this purpose.
VCL will search for the optimal implementation of your function. It will use a permute instruction rather than a gather instruction if n is not too big, because permute instructions are much faster than gather instructions. The n parameter is added in order to enable this optimization
The lookup<n> templates are limiting each index to the interval 0 ≤ i < n for security reasons. If you don't want this security then you may set n = INT_MAX. I will change the code to make sure the interval check is optimized away in this case.