RGB32 byte array memory access optimization

I do not know much about lower level memory optimization but I am trying to figure out how to optimize memory access from a unsigned char array holding rbg32 data.

I am building a directshow transform filter (CTransInPlaceFilter) that needs to average areas of the data. The data given to me is in rgb32 format (B G R 0xFF B G R 0xFF B G R 0xFF etc). I am looping the data and calculating the average red, green and blue channels.

I find when I get dereference the data in the byte array the video playing runs the CPU up and without and accessing, the CPU doesn't do so much work.

After reading a bunch of posts, I think this is a memory to CPU bandwidth bottleneck.

Here is the code for the transform function:

HRESULT CFilter::Transform(IMediaSample *pSample) {

BYTE* data = NULL;
pSample->GetPointer(&data);

if (mVideoType == MEDIASUBTYPE_RGB32) {
    Rect roi(0, 0, 400, 400);           // Normally this is dynamic
    int totalPixels = roi.width * roi.height;

    // Find the average color
    unsigned int totalR = 0, totalG = 0, totalB = 0;
    for (int r = 0; r < roi.height; r++) {
        int y = roi.y + r;
        BYTE* pixel = data + (roi.x + y * mWidth) * 4;      // 4 bytes per pixel

        for (int c = 0; c < roi.width; c++) {
            totalB += *pixel;           // THESE 3 LINES IS THE ISSUE
            totalG += *(++pixel);
            totalR += *(++pixel);
            pixel++;
        }
    }
    int meanR = (int)floor(totalR / totalPixels);
    int meanG = (int)floor(totalG / totalPixels);
    int meanB = (int)floor(totalB / totalPixels);
    // Does work with the averaged data
}
return S_OK; }

So when I run the video without the 3 dereferencing lines I get about 10-14% cpu usage. With those lines I get 30-34% cpu usage.

I also tried to copy the data to a buffer to access the data.

mempy(mData, data, mWidth * mHeight * 4);      // mData is only allocated once in constructor
...
totalB += mData[x + y * mWidth];

The cpu usage became 22-25%.

Is it possible to reduce the cpu usage down close to 10 again? Somehow access the data much quicker? Should I try using asm?

Other info: The video is 10bit 1280 X 720 using GraphEdit to test my filter. My filter does not change the source image (so it does not copy). I may thread this process if that helps.

Thanks in advance!

Edit:

For more info, I added the directshow graph. The video is 10bit but the Lav filters pass RGB32 (8bit) to me. It is debug build, would release speed it up (eventually I will compile to release build).

enter image description here

I ran the 2 different ways (mentioned earlier) and benchmarked their elapsed time. With dereferencing I get around 0.126208 milliseconds for each time transform runs. Without the dereferencing I get 0.009 milliseconds.

I also tried to reduce the loops by doing

for (int c = 0; c < roi.width; c += 4) {
    totalB += pixel[c] + pixel[c + 3] + pixel[c + 6] + pixel[c + 9];
    totalG += pixel[c + 1] + pixel[c + 4] + pixel[c + 7] + pixel[c + 10];
    totalR += pixel[c + 2] + pixel[c + 5] + pixel[c + 8] + pixel[c + 11];
}

This did not change the CPU usage and the elapsed time is still around 0.12

Edit 2:

I also built all dependencies and the project itself in release and i get the same result. Still very slow access.

Solution

I solved the issue. So the problem was that the data was a pointer pointing to data in video memory. I was trying to use data and transferring it from video to computer RAM causing memory bandwidth errors. Copy all the data at once (memcpy) was faster but still very slow. Instead I had to use specific Intel SSE4 commands to efficiently copy the data from video memory to computer memory.

I used this file (gpu_memcpy). It contains a similar function to memcpy but uses the GPU to do the work. It is much faster and after copying, accessing the data is fast as usual.