I'm working on ARM, and I'm trying to optimize downsampling an image, I have used OpenCV cv::resize and its slow ~3ms for 1280*960 to 400*300, I'm trying to use OpenMP to accelerate it, however while putting the parallel for statement, the image has been distorted. I know that is related to private variables and shared data between the threads, but I can't find the problem.
void resizeBilinearGray(uint8_t *pixels, uint8_t *temp, int w, int h, int w2, int h2) {
int A, B, C, D, x, y, index, gray ;
float x_ratio = ((float)(w-1))/w2 ;
float y_ratio = ((float)(h-1))/h2 ;
float x_diff, y_diff;
int offset = 0 ;
#pragma omp parallel for
for (int i=0;i<h2;i++) {
for (int j=0;j<w2;j++) {
x = (int)(x_ratio * j) ;
y = (int)(y_ratio * i) ;
x_diff = (x_ratio * j) - x ;
y_diff = (y_ratio * i) - y ;
index = y*w+x ;
// range is 0 to 255 thus bitwise AND with 0xff
A = pixels[index] & 0xff ;
B = pixels[index+1] & 0xff ;
C = pixels[index+w] & 0xff ;
D = pixels[index+w+1] & 0xff ;
// Y = A(1-w)(1-h) + B(w)(1-h) + C(h)(1-w) + Dwh
gray = (int)(
A*(1-x_diff)*(1-y_diff) + B*(x_diff)*(1-y_diff) +
C*(y_diff)*(1-x_diff) + D*(x_diff*y_diff)
) ;
temp[offset++] = gray ;
}
}
}
Why don't you try replacing temp[offset++] with temp[i*w2 + j]?
Your offset has multiple problems. For one it has a race condition. But worse is that OpenMP is assigning very different i and j values to each thread so they are reading non adjacent parts of memory. That's why your image is distorted.
Besides OpenMP there are several other ways to speed up your code you could try. I don't know ARM but on Intel you can get a big speed up with SSE. Additionally, you could try fixed floating point. I have found speed ups with both in bilinear interpolation. fastcpp.blogspot.no/2011/06/bilinear-pixel-interpolation-using-sse.html