Constant size of task - the same executing time on 1x and 2x CPU - OpenCl

I have a problem with understanding my results regarding Integral algorithm (implemented in OpenCl). I have access to two Intel Xeon E5-2680 v3 , one has 12 cores.

From OpenCl I don't know why but I can see only one device but I can request 12 or 24 cores, so I guess it does not matter if I "see" one or two devices, if 24 cores are used (2 CPUs).

I was running those tasks with max local size = 4096, and minimal global size = 4096, and for 1 CPU and 2 CPU executing time was the same, I was changing global size to 2* 4096, 4* 4096, 8* 4096 and when I reached 16* 4096 global size, 1CPU was slowing down, but 2x CPU was speeding up, and every next global size I changed to bigger than before it stayed that way, 2x CPU was 2x faster than 1x CPU.

I don't understand why from the beginning we can't see advantage of 2x CPU over 1x CPU. What is also important to me, I was collecting power consumption for CPU's, and in that last global size = 8* 4096 when we see the same execution time of 1 and 2 CPUs I can see a bit smaller power consumption for 2 CPUs, and when global size was growing, that 2 CPU consumption was lower than on 1 CPU I guess because of 2x faster time execution, but shouldn't it be equal or bigger than on 1 CPU? What may be important: I checked that always 1 and 2 CPUs have 2.5 Ghz freq, and it is not changing. My questions regarding above are:

Why on smaller global Size's 1 CPU and 2 CPU have equal execution time?
Why on bigger global size's 2 CPU have smaller power consumption.
Why in that one point when Global Size = 8*4096 when we have equal execution times, I have slightly less power consumption with 2 CPUs than 1 CPU.

I need to add that every run was made 10x so those results are not accidental

Here are my results:

Solution

Why on smaller global Size's 1 CPU and 2 CPU have equal execution time?

Because you used 4096 as local size. Each compute unit for a cpu is 1 core. You put 16x4096 for global size so it used 16 cores. Probably you used a memory bound kernel or one core accesses other CPU's cache or memory so it couldn't matter if it used 1 core or N cores. When you increase global size, other CPU memory could be used more often an becomes more symmetrical memory access pattern.

Why on bigger global size's 2 CPU have smaller power consumption.

2 CPU have more cache so they can schedule more kernels at the same time, maybe even reusing of data is made low powered than accessing ram. Gettin data from ram should be more power consuming than getting it from cache.

Why in that one point when Global Size = 8*4096 when we have equal execution times, I have slightly less power consumption with 2 CPUs than 1 CPU.

Using 8 cores(8 * local size), single CPU must have been in use and even if it is not, same memory bank groups could be in use by both CPU and memory bandwidth is bottlenecking. Again, 2 CPUs have more cache so there must some data-reuse to use advantage of bigger cache that decrease power consumption.

You should try different device fission combinations to get maximum locality and data sharability for cores. Threads could be randomly distributed among CPUs and cores and hardware threads. Device fission solves this problem and gives more control over thread scheduling.