OpenCL kernel slower than normal Java loop

I've been looking into OpenCL for use with optimizing code and running tasks in parallel to achieve greater speed over pure Java. Now I'm having a bit of an issue.

I've put together a Java program using LWJGL, which as far as I can tell,should be able to do nearly identical tasks -- in this case adding elements from two arrays together and storing the result in another array -- two separate ways: one with pure Java, and the other with an OpenCL Kernel. I'm using System.currentTimeMillis() to keep track of how long each one takes for arrays with a large number of elements(~10,000,000). For whatever reason, the pure java loop seems to be executing around 3 to 10 times, depending on array size, faster than the CL program. My code is as follows(imports omitted):

public class TestCL {

    private static final int SIZE = 9999999; //Size of arrays to test, this value is changed sometimes in between tests

    private static CLContext context; //CL Context
    private static CLPlatform platform; //CL platform
    private static List<CLDevice> devices; //List of CL devices
    private static CLCommandQueue queue; //Command Queue for context
    private static float[] aData, bData, rData; //float arrays to store test data

    //---Kernel Code---
    //The actual kernel script is here:
    //-----------------
    private static String kernel = "kernel void sum(global const float* a, global const float* b, global float* result, int const size){\n" + 
            "const int itemId = get_global_id(0);\n" + 
            "if(itemId < size){\n" + 
            "result[itemId] = a[itemId] + b[itemId];\n" +
            "}\n" +
            "}";;

    public static void main(String[] args){

        aData = new float[SIZE];
        bData = new float[SIZE];
        rData = new float[SIZE]; //Only used for CPU testing

        //arbitrary testing data
        for(int i=0; i<SIZE; i++){
            aData[i] = i;
            bData[i] = SIZE - i;
        }

        try {
            testCPU(); //How long does it take running in traditional Java code on the CPU?
            testGPU(); //How long does the GPU take to run it w/ CL?
        } catch (Exception e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        }
    }

    /**
     * Test the CPU with pure Java code
     */
    private static void testCPU(){
        long time = System.currentTimeMillis();
        for(int i=0; i<SIZE; i++){
            rData[i] = aData[i] + bData[i];
        }
        //Print the time FROM THE START OF THE testCPU() FUNCTION UNTIL NOW
        System.out.println("CPU processing time for " + SIZE + " elements: " + (System.currentTimeMillis() - time));
    }

    /**
     * Test the GPU with OpenCL
     * @throws LWJGLException
     */
    private static void testGPU() throws LWJGLException {
        CLInit(); //Initialize CL and CL Objects

        //Create the CL Program
        CLProgram program = CL10.clCreateProgramWithSource(context, kernel, null);

        int error = CL10.clBuildProgram(program, devices.get(0), "", null);
        Util.checkCLError(error);

        //Create the Kernel
        CLKernel sum = CL10.clCreateKernel(program, "sum", null);

        //Error checker
        IntBuffer eBuf = BufferUtils.createIntBuffer(1);

        //Floatbuffer for the first array of floats
        FloatBuffer aBuf = BufferUtils.createFloatBuffer(SIZE);
        aBuf.put(aData);
        aBuf.rewind();
        CLMem aMem = CL10.clCreateBuffer(context, CL10.CL_MEM_WRITE_ONLY | CL10.CL_MEM_COPY_HOST_PTR, aBuf, eBuf);
        Util.checkCLError(eBuf.get(0));

        //And the second
        FloatBuffer bBuf = BufferUtils.createFloatBuffer(SIZE);
        bBuf.put(bData);
        bBuf.rewind();
        CLMem bMem = CL10.clCreateBuffer(context, CL10.CL_MEM_WRITE_ONLY | CL10.CL_MEM_COPY_HOST_PTR, bBuf, eBuf);
        Util.checkCLError(eBuf.get(0));

        //Memory object to store the result
        CLMem rMem = CL10.clCreateBuffer(context, CL10.CL_MEM_READ_ONLY, SIZE * 4, eBuf);
        Util.checkCLError(eBuf.get(0));

        //Get time before setting kernel arguments
        long time = System.currentTimeMillis();

        sum.setArg(0, aMem);
        sum.setArg(1, bMem);
        sum.setArg(2, rMem);
        sum.setArg(3, SIZE);

        final int dim = 1;
        PointerBuffer workSize = BufferUtils.createPointerBuffer(dim);
        workSize.put(0, SIZE);

        //Actually running the program
        CL10.clEnqueueNDRangeKernel(queue, sum, dim, null, workSize, null, null, null);
        CL10.clFinish(queue);

        //Write results to a FloatBuffer
        FloatBuffer res = BufferUtils.createFloatBuffer(SIZE);
        CL10.clEnqueueReadBuffer(queue, rMem, CL10.CL_TRUE, 0, res, null, null);

        //How long did it take?
        //Print the time FROM THE SETTING OF KERNEL ARGUMENTS UNTIL NOW
        System.out.println("GPU processing time for " + SIZE + " elements: " + (System.currentTimeMillis() - time));

        //Cleanup objects
        CL10.clReleaseKernel(sum);
        CL10.clReleaseProgram(program);
        CL10.clReleaseMemObject(aMem);
        CL10.clReleaseMemObject(bMem);
        CL10.clReleaseMemObject(rMem);

        CLCleanup();
    }

    /**
     * Initialize CL objects
     * @throws LWJGLException
     */
    private static void CLInit() throws LWJGLException {
        IntBuffer eBuf = BufferUtils.createIntBuffer(1);

        CL.create();

        platform = CLPlatform.getPlatforms().get(0);
        devices = platform.getDevices(CL10.CL_DEVICE_TYPE_GPU);
        context = CLContext.create(platform, devices, eBuf);
        queue = CL10.clCreateCommandQueue(context, devices.get(0), CL10.CL_QUEUE_PROFILING_ENABLE, eBuf);

        Util.checkCLError(eBuf.get(0));
    }

    /**
     * Cleanup after CL completion
     */
    private static void CLCleanup(){
        CL10.clReleaseCommandQueue(queue);
        CL10.clReleaseContext(context);
        CL.destroy();
    }

}

Here are a few example console results from various tests:

CPU processing time for 10000000 elements: 24
GPU processing time for 10000000 elements: 88

CPU processing time for 1000000 elements: 7
GPU processing time for 1000000 elements: 10

CPU processing time for 100000000 elements: 193
GPU processing time for 100000000 elements: 943

Is there something wrong with my coding that's causing the CL to take faster, or is that actually to be expected in cases such as this? If the case is the latter, then when is CL preferable?

Solution

I revised the test to do something which I believe is more computationally expensive than simple addition.

Regarding the CPU test, the line:

rData[i] = aData[i] + bData[i];

was changed to:

rData[i] = (float)(Math.sin(aData[i]) * Math.cos(bData[i]));

And in the CL kernel, the line:

result[itemId] = a[itemId] + b[itemId];

was changed to:

result[itemId] = sin(a[itemId]) * cos(b[itemId]);

I'm now getting console results such as:

CPU processing time for 1000000 elements: 154
GPU processing time for 1000000 elements: 11

CPU processing time for 10000000 elements: 8699
GPU processing time for 10000000 elements: 98

(The CPU is taking longer than I'd like to bother with for tests of 100000000 elements.)

For checking accuracy, I added checks that compare an arbitrary element of rData and res to ensure they're the same. I omitted the result here, as it should suffice to say that they were equal.

Now that the function is more complicated(two trigonometric functions being multiplied together), it appears that the CL kernel is much more efficient than the pure Java loop.