OpenCl simple unsuccessful optimization

I'm learning to use opencl in python and I wanted to optimize one of the function. I learned that this can be done by storing global memory in local memory. However, it doesn't work as it should, the duration is twice as long. This is well done? Can I optimize this code more?

  __kernel void sumOP(   __global float *input,
                     __global float *weights,
                     int layer_size,
                     __global float *partialSums,__local  float* cache)

 {

   private const int i = get_global_id(0);
   private const int in_layer_s = layer_size;
   private const int item_id = get_local_id(0);    
   private const int group_id = get_group_id(0);   
   private const int group_count = get_num_groups(0); 

    const int localsize = get_local_size(0);

            for ( int x = 0; x < in_layer_s; x++ )
            {
              cache[x] = weights[i*in_layer_s + x];
            }


            float total1 = 0;

            for ( int x = 0; x < in_layer_s; x++ )
            {
              total1 +=  cache[x] *input[x];
            }
          partialSums[i] = sigmoid(total1);

 }

Python call

l = opencl.LocalMemory(len(inputs))
event = program.sumOP(queue, output.shape, np.random.randn(6,).shape, inputs.data, weights.data,np.int32(len(inputs)),output.data,l)

Thanks for some advice

Solution

Besides doing a data write race condition with writing to same shared memory address cache[x] by all workitems of a group (as Dithermaster said) and lack of barrier() function, some optimizations can be added after those are fixed:

First loop in kernel

        for ( int x = 0; x < in_layer_s; x++ )
        {
          cache[x] = weights[i*in_layer_s + x];
        }

scans a different memory area for each work item, one element at a time. This is probably wrong in terms of global memory performance because each workitem in their own loop, could be using same memory channel or even same memory bank, hence, all workitems access that channel or bank serially. This is worse if in_layer_s gets a larger value and especially if it is power of 2. To solve this problem, all workitems should access contiguous addresses with their neighbors. GPU works better when global memory is accessed uniformly with workitems. On local memory, it is less of an issue to access randomly or with gaps between workitems. Thats why its advised to use uniform save/load on global while doing random/scatter/gather on local.

Second loop in kernel

        for ( int x = 0; x < in_layer_s; x++ )
        {
          total1 +=  cache[x] *input[x];
        }

is using only single accumulator. This a dependency chain that needs each loop cycle to be completed before moving on to next. Use at least 2 temporary "total" variables and unroll the loop. Here, if in_layer_s is small enough, input array could be moved into local or constant memory to access it faster (repeatedly by all workitems, since all workitems access same input array) (maybe half of input to constant memory and other half to local memory to increase total bandwidth)

Is weights[i*in_layer_s + x]; an array of structs? If yes, you can achieve a speedup by making it a struct of arrays and get rid of first loop's optimization altogether, with an increase of code bloat in host side but if priority is speed then struct of arrays is both faster and readable on gpu side. This also makes it possible to upload only the necessary weights data(an array of SOA) to gpu from host side, decreasing total latency (upload + compute + download) further.

You can also try asynchronous local<-->global transfer functions to make loading and computing overlapped for each workitem group, to hide even more latency as a last resort. https://www.khronos.org/registry/OpenCL/sdk/1.0/docs/man/xhtml/async_work_group_copy.html