As far as I understand, to execute concurrent kernels (in my case same kernel but different I/O data) it must be done by launching unique compute units (streaming multiprocessors -SMs) with apparently their own workgroups.
For example gtx960m has 5 SMs (compute units in Opencl). Launching clEnqueueNDRangeKernel
5 times asynchronously and out of order, with their own 16x16 (2d) workgroup, will launch all 5 compute units to execute them concurrently? The local memory reported is 64kb. That is for all compute units or each one will have 64kb by its own?
Each CU has either 4 (Maxwell/Pascal) or 2 (Turing/Ampere, AMD) Warps. A Warp is a group of 32 CUDA cores / stream processors in hardware.
All threads running in one Warp have to do exactly the same instructions. Within a Warp, not even branching is possible. Two Warps within a CU can handle different branches, but not different kernels at the same time.
If you execute two kernels in different queues in parallel on your 960m with 5 CUs, Kernel 1 can have for example 3 CUs and kernel 2 can have the remaining 2. But a CU cannot be split to run multiple kernels at the same time.
In OpenCL you can set the workgroup size to some multiple of the Warp size (32). Either 4 (workgroup size 32), 2 (workgroup size 64) or 1 (workgroup size 128 or greater) OpenCL workgroups can be executed at one moment on one Maxwell CU.
The amount of local memory, in your case 64KB, is per CU. So if you have a large workgroup of for example 256 threads, each thread has less local memory available than if you have workgroup size 64, because all threads in the workgroup share the same local memory uf the one CU they run on.