c algorithm image-processing optimization image-optimization

A more faster (optimized) solution to image decimation (C++)

I am looking for a more faster way of dealing with the following C code. I have an image of 640x480 and I want to decimate it by a factor of 2 by removing every other rows and columns in the image. I have attached the code in the following. Is there any better way to optimize the code.

#define INPUT_NUM_ROW 480
#define INPUT_NUM_COL 640
#define OUTPUT_NUM_ROW 240
#define OUTPUT_NUM_COL 320

unsigned char inputBuf[INPUT_NUM_ROW* INPUT_NUM_COL];
unsigned char outputBuf[OUTPUT_NUM_ROW* OUTPUT_NUM_COL];

void imageDecimate(unsigned char *outputImage , unsigned char *inputImage)
{
/* Fill in your code here */
for (int p = 0; p< OUTPUT_NUM_ROW; p++) {
    for (int q = 0; q < OUTPUT_NUM_COL; q++) {
        outputImage[p*OUTPUT_NUM_COL + q] = inputImage[(p*INPUT_NUM_COL+q)*2];
        // cout << "The pixel at " << p*OUTPUT_NUM_COL+q << " is " << outputImage[p*OUTPUT_NUM_COL+q] << endl;
    }
  }
}

Solution

Rather than doing the math every time in the inner loop, you could do this:

int outputIndex;
int inputIndex;
for (int p = 0; p< OUTPUT_NUM_ROW; p++) {
    inputIndex = p * INPUT_NUM_COL * 2;
    outputIndex = p * OUTPUT_NUM_COL;
    for (int q = 0; q < OUTPUT_NUM_COL; q++) {
        outputImage[outputIndex] = inputImage[inputIndex];
        inputIndex += 2;
        outputIndex++;
        // cout << "The pixel at " << p*OUTPUT_NUM_COL+q << " is " << outputImage[p*OUTPUT_NUM_COL+q] << endl;
    }
  }
}

You could do the incrementing inline with the copying assignment too, and you could also only assign inputIndex and outputIndex the first time, but it wouldn't get you as much of a performance boost as moving the calculation out of the inner loop. I assume that bulk copying functions don't have this kind of incrementing flexibility, but if they do and they use hardware acceleration that is available on all of your target platforms, then that would be a better choice.

I am also assuming that array access like this compiles down to the most optimized pointer arithmetic that you could use.