vectorization opencl gpgpu amd-processor amd-gpu

Should we use the vector-types, if we want to write once optimized code for both: CPU and GPU?

As known, OpenCL vector-type float16

float16 on AMD GPU (GCN) doesn't use addition vector operations, because vector operations used even without vector-types by using WaveFront (each thread = each SIMD-lane). I.e. float16 can help only for load/store on large width bus of memory, for example on HBM (High Bandwidth Memory): https://stackoverflow.com/a/42315728/1558037
but float16 on AMD CPU is recommended to use for involving SIMD-lanes of CPU (because each thread = each whole CPU-Core, not SIMD-lane): http://developer.amd.com/tools-and-sdks/opencl-zone/opencl-resources/programming-in-opencl/image-convolution-using-opencl/image-convolution-using-opencl-a-step-by-step-tutorial-5/

As a result:

on GCNs one thread views one SIMD element - i.e. one thread mapped on one SIMD-lane): Is there any guarantee that all of threads in WaveFront (OpenCL) always synchronized?
on CPU one thread mapped on whole one CPU-Core (with many SIMD-blocks each with many SIMD-lanes)

I.e. vector-types such as float16 does not matter much for the GPU, but are of great importance for the CPU.

Should we use the vector-types, if we want to write once optimized OpenCL-code for both architectures: CPU and GPU?

Conclusion:

Vector types are not much needed for GPU or Intel-CPU, but needed for AMD-CPU.

Solution

In general, if performance is what you're concerned about, it is almost always a bad idea to use a same kernel for different architectures. Pre-GCN's want vectors, GCN's want scalars, CPU's can handle both with Intel driver but only if you are awared of it, and I don't know how AMD's driver is doing on a CPU. While CPU need wider vectors than GPU. CPU's rely on cache and GPU's rely more on scratch memory. GPU's have insanely more registers than CPU's can even dream of...

On GCN's actually vector types just make me feel my code looks nicer, and save some time on typing and making mistakes. float v[4], float4 v, or even float v0, v1, v2, v3, doesn't make much difference for the most of time.

And as said before, Intel's CL driver can map a thread to a SIMD element, which make one core 8 CL threads.