Search code examples
image-processingarmneon

ARM NEON Optimization for image transformation


I'm applying an NV12 video transformation which shuffles pixels of the video. On an ARM device such as Google Nexus 7 2013, performance is pretty bad at 30fps for a 1024x512 area with the following C code:

* Pre-processing done only once at beginning of video *

//Temporary tables for the destination
for (j = 0; j < height; j++)
    for (i = 0; i < width; i++) {
        toY[i][j] = j * width + i;
        toUV[i][j] = j / 2 * width + ((int)(i / 2)) * 2;
    }

//Temporary tables for the source
for (j = 0; j < height; j++)
    for (i = 0; i < width; i++) {
        fromY[i][j] = funcY(i, j) * width + funcX(i, j);
        fromUV[i][j] = funcY(i, j) / 2 * width + ((int)(funcX(i, j) / 2)) * 2;
    }

* Process done at each frame *

for (j = 0; j < height; j++)
    for (i = 0; i < width; i++) {
        destY[ toY[i][j] ] = srcY[ fromY[i][j] ];
        if ((i % 2 == 0) && (j % 2 == 0)) {
            destUV[ toUV[i][j] ] = srcUV[ fromUV[i][j] ];
            destUV[ toUV[i][j] + 1 ] = srcUV[ fromUV[i][j] + 1 ];
        }
    }

Though it's computed only once, funcX/Y is a pretty complex transformation so it's not very easy to optimize this part.

Is there still a way to fasten the double loop computed at each frame with the given "from" tables?


Solution

  • You create FOUR lookup tables 8 times as large as the original image?

    You put an unnecessary if statement in the inner most loop?

    What about swapping i and j?

    Honestly, your question should be tagged with [c] instead of arm, neon, or image-processing to start with.

    Since you didn't show what funcY and funcX do, the best answer I can give is following. (Performance skyrocketed. And it's something really really fundamental)

    //Temporary tables for the source
    pTemp = fromYUV;
    for (j = 0; j < height; j+=2)
    {
        for (i = 0; i < width; i+=2) {
           *pTemp++ = funcY(i, j) * width + funcX(i, j);
           *pTemp++ = funcY(i+1, j) * width + funcX(i+1, j);
           *pTemp++ = funcY(i, j) / 2 * width + ((int)(funcX(i, j) / 2)) * 2;
       }
        for (i = 0; i < width; i+=2) {
           *pTemp++ = funcY(i, j+1) * width + funcX(i, j+1);
           *pTemp++ = funcY(i+1, j+1) * width + funcX(i+1, j+1);
       }
    }
    
    * Process done at each frame *
    pTemp = fromYUV;
    pTempY = destY;
    pTempUV = destUV;
    for (j = 0; j < height; j+=2)
    {
        for (i = 0; i < width; i+=2) {
            *pTempY++ = srcY[*pTemp++];
            *pTempY++ = srcY[*pTemp++];
            *pTempUV++ = srcUV[*pTemp++];
        }
        for (i = 0; i < width; i+=2) {
            *pTempY++ = srcY[*pTemp++];
            *pTempY++ = srcY[*pTemp++];
        }
    }
    

    You should avoid (when possible) :

    • access on multiple memory area
    • random memory access
    • if statements within loops

    The worst crime you committed is the order of i and j. (Which you don't need to start with)

    If you access a pixel at the coordinate x and y, it's pixel = image[y][x] and NOT image[x][y]