performance parallel-processing opencl low-latency pyopencl

Compilation warning OpenCL Matrix Multplication

Why is this not vectorizing?

__attribute__((num_simd_work_items(4)))
__attribute__((num_compute_units(2)))
__attribute__((reqd_work_group_size(16,16,1)))
__kernel void matrix_multiplication(const int fDIM,const int gDIM, const int hDIM,
__global float*  A, __global float* B, __global float* C) {
    int k;
    int i = get_global_id(0);
    int j = get_global_id(1);
    float temp_result;
    if((i < gDIM) && (j<fDIM)){
      temp_result= 0.0f;
        for(k = 0; k<hDIM;k++) {
          temp_result+= A[i*gDIM+k] * B[k*hDIM+j];

        }
      C[i*gDIM+j] = temp_result;

    }
}

Compiler Warning:

Kernel Vectorization: branching is thread ID dependent ... cannot vectorize.

Solution

Q : Why is this not vectorizing?

The evil is the "branching…cannot vectorize" - it relates to this instruction:

if( ( i < gDIM ) && ( j < fDIM ) ){ ... }

Efficient SIMD-instructions based vectorisation means all code-execution flows are not "divergent" (branched) and do "execute" the very same data/instruction (i.e. data elements SIMD-"glued" into Vectors of DATA, put into wide-enough, CPU, SIMD-friendly, registers, that get computed at once by a single SIMD-friendly instruction - i.e. the very same for each thread-in-a-pack SIMD-friendly instruction, i.e. not if(){...}else{...}-diverged into different, "divergent" flow-of different sequences of different instructions for different data-elements

It is principally impossible to want do different operations for different parts of the data, aligned into the SIMD-friendly CPU register - one and only one SIMD-friendly instruction can be executed at once for all vector-components stored into the SIMD-friendly CPU-register.

Hardware details on integer and floats SIMD-vector instructions vary, as does the resulting micro-ops latency, SIMD-processor specific details form compilator do matter a lot, yet the principle of avoiding divergent paths is common for the automated SIMD-vectorisation in the compiler phase. For more deails on SIMD-instructions and their further performance-limiting properties may read and learn from Agner