Understanding in details the algorithm for inversion of a high number of 3x3 matrixes

I make following this original post : PyCuda code to invert a high number of 3x3 matrixes. The code suggested as an answer is :

$ cat t14.py
import numpy as np
import pycuda.driver as cuda
from pycuda.compiler import SourceModule
import pycuda.autoinit
# kernel
kernel = SourceModule("""

__device__ unsigned getoff(unsigned &off){
  unsigned ret = off & 0x0F;
  off >>= 4;
  return ret;
}   

// in-place is acceptable i.e. out == in) 
// T = float or double only
const int block_size = 288;
typedef double T; // *** can set to float or double
__global__ void inv3x3(const T * __restrict__ in, T * __restrict__ out, const size_t n, const unsigned * __restrict__ pat){

  __shared__ T si[block_size];
  size_t idx = threadIdx.x+blockDim.x*blockIdx.x;
  T det = 1;
  if (idx < n*9)
    det = in[idx];
  unsigned sibase = (threadIdx.x / 9)*9;
  unsigned lane = threadIdx.x - sibase; // cheaper modulo
  si[threadIdx.x] = det;
  __syncthreads();
  unsigned off = pat[lane];
  T a  = si[sibase + getoff(off)];
  a   *= si[sibase + getoff(off)];
  T b  = si[sibase + getoff(off)];
  b   *= si[sibase + getoff(off)];
  a -= b;
  __syncthreads();
  if (lane == 0) si[sibase+3] = a;
  if (lane == 3) si[sibase+4] = a;
  if (lane == 6) si[sibase+5] = a;
  __syncthreads();
  det =  si[sibase]*si[sibase+3]+si[sibase+1]*si[sibase+4]+si[sibase+2]*si[sibase+5];
  if (idx < n*9)
    out[idx] = a / det;
}   

""")
# host code
def gpuinv3x3(inp, n):
    # internal constants not to be modified
    hpat = (0x07584, 0x08172, 0x04251, 0x08365, 0x06280, 0x05032, 0x06473, 0x07061, 0x03140)
    # Convert parameters into numpy array
    # *** change next line between float32 and float64 to match float or double
    inpd = np.array(inp, dtype=np.float64)
    hpatd = np.array(hpat, dtype=np.uint32)
    # *** change next line between float32 and float64 to match float or double
    output = np.empty((n*9), dtype= np.float64)
    # Get kernel function
    matinv3x3 = kernel.get_function("inv3x3")
    # Define block, grid and compute
    blockDim = (288,1,1) # do not change
    gridDim = ((n/32)+1,1,1)
    # Kernel function
    matinv3x3 (
        cuda.In(inpd), cuda.Out(output), np.uint64(n), cuda.In(hpatd),
        block=blockDim, grid=gridDim)
    return output
inp = (1.0, 1.0, 1.0, 0.0, 0.0, 3.0, 1.0, 2.0, 2.0, 1.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 1.0)
n = 2
result = gpuinv3x3(inp, n)
print(result.reshape(2,3,3))

The result gives, on an initial 1D array containing 18 values (so 2 matrixes 3x3), the right inverted matrixes, i.e :

[[[ 2.         -0.         -1.        ]
  [-1.         -0.33333333  1.        ]
  [-0.          0.33333333 -0.        ]]

 [[ 1.          0.          0.        ]
  [ 0.          1.          0.        ]
  [ 0.          0.          1.        ]]]

Main issue : I would like to understand in detail the working of this algorithm, especially how the kernel allows to use shared memory for the initial 1D vector and brings then optimization when I execute this code on a large number of 3x3 matrixes.

I understand the line : size_t idx = threadIdx.x+blockDim.x*blockIdx.x; which gives the global index of current work-item identified by local threadIdx and blockIdx of the current working-group block.
I understand that __shared__ T si[block_size]; represents a share array, i.e associated to work-group blocks : this is what we call Local Memory.

On the other hand, I don't understand this following part of kernel code :

 __shared__ T si[block_size];

 size_t idx = threadIdx.x+blockDim.x*blockIdx.x;
 T det = 1;
 if (idx < n*9)
   det = in[idx];
 unsigned sibase = (threadIdx.x / 9)*9;
 unsigned lane = threadIdx.x - sibase; // cheaper modulo
 si[threadIdx.x] = det;
 __syncthreads();
 unsigned off = pat[lane];
 c
 __syncthreads();
 if (lane == 0) si[sibase+3] = a;
 if (lane == 3) si[sibase+4] = a;
 if (lane == 6) si[sibase+5] = a;
 __syncthreads();

Indeed, what's the role of sibase index defined by unsigned sibase = (threadIdx.x / 9)*9;

and also, what is the utility of parameter lane defined by : unsigned lane = threadIdx.x - sibase; // cheaper modulo

Finally, shifting are applied with :

      T a  = si[sibase + getoff(off)];
      a   *= si[sibase + getoff(off)];
      T b  = si[sibase + getoff(off)];
      b   *= si[sibase + getoff(off)];
      a -= b;

But I don't see clearly the functionality.

Same problem for me about this part :

 if (lane == 0) si[sibase+3] = a;
 if (lane == 3) si[sibase+4] = a;
 if (lane == 6) si[sibase+5] = a;

The determinant is calculated in a weird way that I can't grasp, i.e :

 det =  si[sibase]*si[sibase+3]+si[sibase+1]*si[sibase+4]+si[sibase+2]*si[sibase+5];

I am not beginner in OpenCL, but I am not enough expert to understand fully this kernel code.

Solution

Preliminaries

First, its important to understand the arithmetic of a 3x3 matrix inversion, see here (and below).

The general methodology used for kernel design is to assign one matrix result element per thread. Therefore I will need 9 threads per matrix. Ultimately each thread will be responsible for computing one of the 9 numerical results, for each matrix. In order to compute two matrices, we then need 18 threads, 3 matrices require 27 threads.

An ancillary task is to decide threadblock/grid sizing. This follows typical methods (overall problem size determines total number of threads needed), but we will make a specific choice of 288 for threadblock size, as this is a convenient multiple of both 9 (number of threads per matrix) and 32 (number of threads per warp in CUDA), which gives us a certain measure of efficiency (no wasted threads, no gaps in data storage).

Since our thread strategy is one thread per matrix element, we must collectively solve the matrix inversion arithmetic using 9 threads. The major tasks are to compute the transposed matrix of cofactors, and then to compute the determinant, then do the final arithmetic (divide by the determinant) to compute each result element.

Computation of the cofactors

The first task is to compute the transposed matrix of cofactors of A, called M:

        |a b c|
let A = |d e f|
        |g h i|

    
        |ei-fh ch-bi bf-ce|
    M = |fg-di ai-cg cd-af|
        |dh-eg bg-ah ae-bd|

We have 9 threads for this task, and nine elements of matrix M to compute, so we will assign one thread to each element of M. Each element of M depends on multiple input values (a, b, c, etc.) so we shall first load each input value (there are 9, one per thread), into shared memory:

  // allocate enough shared memory for one element per thread in the block:
  __shared__ T si[block_size];
  // compute a globally unique thread index, so each thread has a unique number 0,1,2,etc.
  size_t idx = threadIdx.x+blockDim.x*blockIdx.x;
  // establish a temporary variable that will use and reuse during thread processing
  T det = 1;
  // do a thread check to make sure that our next load will be in-bounds for the input array in
  if (idx < n*9)
  // load one element per thread, 9 threads per matrix will load an entire matrix
    det = in[idx];
  // for a given matrix (9 threads) compute the base offset into shared memory, where this matrix data (9 elements) will be stored.  All 9 threads have the same base offset
  unsigned sibase = (threadIdx.x / 9)*9;
  // for each group of 9 threads handling a matrix, compute for each thread in that group, a group offset or "lane" from 0..8, so each thread in the group has a unique identifier/assignment in the group
  unsigned lane = threadIdx.x - sibase; // cheaper modulo
  // let each thread place its matrix element a,b,c, etc. into shared memory
  si[threadIdx.x] = det;
  // shared memory is now loaded, make sure all threads have loaded before any calculations begin
  __syncthreads();

now that each A matrix element (a, b, c, ...) is loaded into shared memory, we can start to compute the cofactors in M. Let's focus on a particular thread (0) and its cofactor (ei-fh). All of the needed matrix elements to compute this cofactor (e, i, f, and h) are now in shared memory. We need a method to load them in sequence, and perform the needed multiplications and subtractions.

At this point we observe two things:

each M element (cofactor) has a different set of 4 needed elements of A
each M element (cofactor) follows the same general arithmetic, given four arbitrary elements of A, lets refer to them generically as X, Y, Z and W. The arithmetic is XY-ZW. I take the first element, multply it by the second, and then take the third and fourth element and multiply them together, then subtract the two products.

Since the general sequence of operations (2, above) is the same for all 9 cofactors, we only need a method to arrange the loading of the 4 needed matrix elements. This methodology is encoded into the load patterns that are hard-coded into the example:

 hpat = (0x07584, 0x08172, 0x04251, 0x08365, 0x06280, 0x05032, 0x06473, 0x07061, 0x03140)

There are 9 load patterns, each occupying a hexadecimal quantity, one load pattern per thread, i.e. one load pattern per M matrix element (cofactor). Within a particular A matrix, the matrix elements a, b, c etc. are (already) loaded into shared memory at group offsets of 0, 1, 2, etc. The load pattern for a given thread will allow us to generate the sequence of group offsets, needed to retrieve the matrix elements of A from their locations in shared memory, to be used in sequence to compute the cofactor assigned to that thread. Considering thread 0, and its cofactor ei-fh, how does the load pattern 0x7584 encode the needed pattern to select e, then i, then f, then h?

For this we have a helper function getoff which takes a load pattern, and successively (each time it is called) strips off an index. The first time I call getoff with an argument of 0x7584, it "strips off" the index 4, returns that, and replaces the 0x7584 load pattern with 0x758 for the next usage. 4 corresponds to e. The next time I call getoff with 0x758 it "strips off" the index 8, returns that, and replaces 0x758 with 0x75. 8 corresponds to i. The next time produces the index 5, corresponding to f, and the last time produces the index 7, corresponding to h.

With that description then we will walk through the code, pretending we are thread 0, and describe the process of computing ei-fh:

  // get the load pattern for my matrix "lane"
  unsigned off = pat[lane];
  //load my temporary variable `a` with the first item indexed in the load pattern:
  T a  = si[sibase + getoff(off)];
  // multiply my temporary variable `a` with the second item indexed in the load pattern
  a   *= si[sibase + getoff(off)];
  //load my temporary variable `b` with the third item indexed in the load pattern
  T b  = si[sibase + getoff(off)];
  // multiply my temporary variable `b` with the fourth item indexed in the load pattern
  b   *= si[sibase + getoff(off)];
  // compute the cofactor by subtracting the 2 products
  a -= b;

sibase, as already indicated in the first commented code section, is the base offset in shared memory where that A matrix elements are stored. The getoff function then adds to this base address to select the relevant input element.

Computation of the determinant

The numerical value of the determinant is given by:

det(A) = det = a(ei-fh) - b(di-fg) + c(dh-eg)

If we decompose this, we see that all terms are actually already computed:

a,b,c:  these are input matrix elements, in shared locations (group offsets) 0, 1, 2
ei-fh:  cofactor computed by thread 0
di-fg:  cofactor computed by thread 3 (with sign reversed)
dh-eg:  cofactor computed by thread 6

Now, every thread will need the value of the determinant because it will be used by each thread during computation of its final (result) element. Therefore we will have every thread in the matrix redundantly compute the same value (which is more efficient than computing it, say, in one thread, then broadcasting that value to the other threads). In order to facilitate this, we will need 3 of the already computed cofactors made available to all 9 threads. So we will select 3 (no longer needed) locations in shared memory to "publish" these values. We still need the values in locations 0, 1, 2 because we need the input matrix elements a, b, and c for calculation of the determinant. But we no longer need the input elements in locations 3, 4, or 5 for the remainder of our work, so we will reuse those:

  // we are about to change shared values, so wait until all previous usage is complete
  __syncthreads();
  // load cofactor computed by thread 0 into group offset 3 in shared
  if (lane == 0) si[sibase+3] = a;
  // load cofactor computed by thread 3 into group offset 4 in shared
  if (lane == 3) si[sibase+4] = a;
  // load cofactor computed by thread 6 into group offset 5 in shared
  if (lane == 6) si[sibase+5] = a;
  // make sure shared memory loads are complete
  __syncthreads();
  // let every thread compute the determinant (same for all threads)
  //       a       * (ei-fh)    +  b         * -(fg-di)   +  c         * (dh-eg)
  det =  si[sibase]*si[sibase+3]+si[sibase+1]*si[sibase+4]+si[sibase+2]*si[sibase+5];

Calculation of final result

This only involves (for each thread) dividing the previously computed cofactor for that thread, by the just-computed determinant, and storing that result:

  // another thread check: make sure this thread is actually doing useful work
  if (idx < n*9)
  // take previously computed cofactor, divide by determinant, store result
    out[idx] = a / det;