I have a Vectorization optimization problem.
I have a struct pDst which have 3 fields named: 'red', 'green' and 'blue'.
The type might be 'Char', 'Short' or 'Float'.This is given and can not be altered.
Theres is another array pSrc which represents an image [RGB] - Namely an array of 3 pointers which every one of them point to a layer of an image.
Each layer is built using IPP plane oriented image (Namely, Each plane is formed independently - 'ippiMalloc_32f_C1'):
http://software.intel.com/sites/products/documentation/hpc/ipp/ippi/ippi_ch3/functn_Malloc.html.
We would like to copy it as described in the following code:
for(int y = 0; y < imageHeight; ++y)
{
for(int x = 0; x < imageWidth; ++x)
{
pDst[x + y * pDstRowStep].red = pSrc[0][x + y * pSrcRowStep];
pDst[x + y * pDstRowStep].green = pSrc[1][x + y * pSrcRowStep];
pDst[x + y * pDstRowStep].blue = pSrc[2][x + y * pSrcRowStep];
}
}
Yet, in this form the compiler can't vectorize the code.
At first it says:
"loop was not vectorized: existence of vector dependence.".
When I use the #pragma ivdep to help the compiler (Since there's no dependence) I get the following error:
"loop was not vectorized: dereference too complex.".
Anyone has an idea how to allow vectorization?
I use Intel Compiler 13.0.
Thanks.
If I edit the code as following:
Ipp32f *redChannel = pSrc[0];
Ipp32f *greenChannel = pSrc[1];
Ipp32f *blueChannel = pSrc[2];
for(int y = 0; y < imageHeight; ++y)
{
#pragma ivdep
for(int x = 0; x < imageWidth; ++x)
{
pDst[x + y * pDstRowStep].red = redChannel[x + y * pSrcRowStep];
pDst[x + y * pDstRowStep].green = greenChannel[x + y * pSrcRowStep];
pDst[x + y * pDstRowStep].blue = blueChannel[x + y * pSrcRowStep];
}
}
For output types of 'char' and 'short' I get vecotization.
Yet for type of 'float' I don't.
Instead I get the following message:
loop was not vectorized: vectorization possible but seems inefficient.
How could that be?
In the following code, using pragma ivdep does surely ignore the vector dependence but the compiler heuristics/cost analysis came to a conclusion that vectorizing the loop is not efficient:
Ipp32f *redChannel = pSrc[0];
Ipp32f *greenChannel = pSrc[1];
Ipp32f *blueChannel = pSrc[2];
for(int y = 0; y < imageHeight; ++y)
{
#pragma ivdep
for(int x = 0; x < imageWidth; ++x)
{
pDst[x + y * pDstRowStep].red = redChannel[x + y * pSrcRowStep];
pDst[x + y * pDstRowStep].green = greenChannel[x + y * pSrcRowStep];
pDst[x + y * pDstRowStep].blue = blueChannel[x + y * pSrcRowStep];
}
}
The vectorization will be inefficient since the operation involves copying contiguous block of memory from source to non-contiguous memory locations at the destination. So there is a scatter happening here. If you still want to enforce vectorization and see if there any performance improvement in comparison to non-vectorized version, please use pragma simd instead of pragma ivdep as shown below:
#include<ipp.h>
struct Dest{
float red;
float green;
float blue;
};
void foo(Dest *pDst, Ipp32f **pSrc, int imageHeight, int imageWidth, int pSrcRowStep, int pDstRowStep){
Ipp32f *redChannel = pSrc[0];
Ipp32f *greenChannel = pSrc[1];
Ipp32f *blueChannel = pSrc[2];
for(int y = 0; y < imageHeight; ++y)
{
#pragma simd
for(int x = 0; x < imageWidth; ++x)
{
pDst[x + y * pDstRowStep].red = redChannel[x + y * pSrcRowStep];
pDst[x + y * pDstRowStep].green = greenChannel[x + y * pSrcRowStep];
pDst[x + y * pDstRowStep].blue = blueChannel[x + y * pSrcRowStep];
}
}
return;
}
The corresponding vectorization report is:
$ icpc -c test.cc -vec-report2
test.cc(14): (col. 9) remark: SIMD LOOP WAS VECTORIZED
test.cc(11): (col. 5) remark: loop was not vectorized: not inner loop
More documentation on pragma simd is available at https://software.intel.com/en-us/node/514582.