gpu intel gpgpu hardware-acceleration xeon-phi

Coprocessor accelerators compared to GPUs

Are coprocessors like Intel Xeon-Phi supposed to be utilized much like the GPUs, so that one should offload a large amount of blocks executing a single kernel, so that only the overall throughput the coprocessor handles results in a speed up, OR offloading independent threads (tasks) will increase the efficiency as well?

Solution

The Xeon Phi requires a large degree of both functional parallelism (different threads) and vector parallelism (SIMD). Since the cores are essentially enhanced Pentium processors, serial code runs slowly. This will change somewhat with the next generation as it'll use faster and more modern cores. The current Xeon Phi also suffers from the I/O bottleneck as does any coprocessor, having to communicate over a PCIe bus.

So though you could offload a kernel to every processor and exploit the 512-bit vectorization (similar to a GPGPU), you can also separate your code into many different functional blocks (i.e. different codes/kernels) and run them on different sets of Intel Xeon Phi cores. Again, the different blocks of code must also exploit the 512-bit SIMD vectors.

The Xeon Phi also operates as a native processor, so you can access other resources by mounting NFS directory trees, communication between cards and other processors in the cluster using TCP/IP, using MPI, etc. Note that this is not 'offload' but native execution. But the PCIe bus is still a significant bottle neck limiting I/O.

To summarize,

You can us an offload model similar to that used by GPGPUs,
The Xeon Phi itself also can support functional parallelism (more than one kernel) but each kernel must also exploit the 512-bit SIMD.
You can also write native code and use MPI, treating the Xeon Phi as a conventional (non-offload) node (always remembering the PCIe I/O bottleneck)