matrix opencl matrix-multiplication hpc multi-gpu

Parallel Matrix Multiplication using multi GPU

I have installed two GPUs (2x Nvidia Quadro 410) in my system in different pci slots. To solve Martix multiplication on both of these GPU, how can I split the input matrices such that each GPU processes/computes a part of output matrix and then returns it back. For eg. for two matrix A, B each of order 10x10 , then the to compute the output matrix C= A x B ,such that ,out of 100 elements(10 x 10) 50 elements should be calculated on 1st GPU and other half i.e 50 to b computed in 2nd GPU. I am trying to implement it on OpenCL. But, any algorithm is welcomed which will help me come up with the solution.

Solution

In general, if you have matrices X (of size axb, rows first) and Y (of size bxc),

X * Y = vcat(X[0:a/2,0:b] * Y, X[a/2:a,0:b] * Y)

In this pseudocode, vcat is vertical concatenation (putting one matrix on top of each other, e.g. a 4x3 matrix concatenated with 2x3 matrix will produce a 6x3 matrix), : denotes ranges and [] is indexing.

Both arguments to vcat can be computed on different GPUs, and the concatenation can be achieved just by pointing the output to different sub-regions of the output buffer (assuming we have C-ordered arrays). The initial splitting of X can be similarly achieved just by using different sub-regions (since it is split along a row).