I have installed two GPUs (2x Nvidia Quadro 410) in my system in different pci slots. To solve Martix multiplication on both of these GPU, how can I split the input matrices such that each GPU processes/computes a part of output matrix and then returns it back. For eg. for two matrix A, B each of order 10x10 , then the to compute the output matrix C= A x B ,such that ,out of 100 elements(10 x 10) 50 elements should be calculated on 1st GPU and other half i.e 50 to b computed in 2nd GPU. I am trying to implement it on OpenCL. But, any algorithm is welcomed which will help me come up with the solution.
In general, if you have matrices X
(of size a
xb
, rows first) and Y (of size b
xc
),
X * Y = vcat(X[0:a/2,0:b] * Y, X[a/2:a,0:b] * Y)
In this pseudocode, vcat
is vertical concatenation (putting one matrix on top of each other, e.g. a 4x3 matrix concatenated with 2x3 matrix will produce a 6x3 matrix), :
denotes ranges and []
is indexing.
Both arguments to vcat
can be computed on different GPUs, and the concatenation can be achieved just by pointing the output to different sub-regions of the output buffer (assuming we have C-ordered arrays). The initial splitting of X
can be similarly achieved just by using different sub-regions (since it is split along a row).