parallel-processing opencl processing frame-rate

OpenCL: how to speed the fps for process image

I have started to use OpenCL for the first time and I'm trying to do this example for sobel edge detection in this site http://www.karlosp.net/blog/2012/05/03/opencl-opencv-sobel-edge-detector/ but when run the kernal for gpu number fps is less than 15 and the gpu utilization is less than 5% how can run all thread for gpu like openmp to make utilization at less 95%

the code

kernal code

_kernel void sobel(_global float *A, __global float *R, uint width, uint height) {

int globalIdx = get_global_id(0);
int globalIdy = get_global_id(1);
int index = width * globalIdy + globalIdx;
float a,b,c,d,e,f,g,h,i;
float sobelX = 0;
float sobelY = 0;
if(index > width && index < (height*width)-width && (index % width-1) > 0 && (index % width-1) < width-1){
    a = A[index-1-width] * -1.0f;
    b = A[index-0-width] *  0.0f;
    c = A[index+1-width] * +1.0f;
    d = A[index-1] * -2.0f;
    e = A[index-0] *  0.0f;
    f = A[index+1] * +2.0f;
    g = A[index-1+width] * -1.0f;
    h = A[index-0+width] *  0.0f;
    i = A[index+1+width] * +1.0f;
    sobelX = a+b+c+d+e+f+g+h+i;

    a = A[index-1-width] * -1.0f;
    b = A[index-0-width] * -2.0f;
    c = A[index+1-width] * -1.0f;
    d = A[index-1] * 0.0f;
    e = A[index-0] * 0.0f;
    f = A[index+1] * 0.0f;
    g = A[index-1+width] * +1.0f;
    h = A[index-0+width] * +2.0f;
    i = A[index+1+width] * +1.0f;
    sobelY = a+b+c+d+e+f+g+h+i;
}


R[index] = sqrt(pow(sobelX,2) + pow(sobelY,2));

}

Solution

The code you link to has a few inefficiencies (no particular order):

The calls to clFinish after every OpenCL call inside the loop are unnecessary. Firstly, the clEnqueueWriteBuffer and clEnqueueReadBuffer are both using blocking set to CL_TRUE, i.e. they will not return until the write/read has finished (the definition of a blocking function call).
You convert your unsigned char image into float image before sending it to the GPU. This is not necessary as the GPU is capable of working with unsigned char and can typecast to float if needed. Converting to float on the CPU followed by sending to the GPU causes one to send 4x the amount of data (4 bytes per pixel per channel vs. 1 byte).
You call cvWaitKey(10), pausing for 10 milliseconds to wait for a keypress, hence this code will never exceed 100 frames per second (minor issue).
The RGB to gray-scale conversion can be done on the GPU, at the expense of sending 3 unsigned chars so there is a trade off there that needs testing.

Your timing approach is also flawed. Your code measures the time taken to load, process and display a single frame. Processing consists of both OpenCL and OpenCV components. You should benchmark each of these separately to determine how long each one takes so that you can determine exactly where your bottleneck lies.

What also just occurred to me is what is the frame-rate of the camera that OpenCV is capturing from?