I have a matrix (M) of floats, roughly 17000 by 10000 values. I need to get scalar multiplications of each row by each row (so 17000 by 17000 values), which can be alternatively formalized as multiplying M by the transposed M.
I am new to CUDA, so I can write a "naive" solution using a thread for every matrix element but it's probably suboptimal computation speed-wise.
Alternatively, I can use something like cublasSgemm(...)
with M and the transposed M as arguments, but the transposing is an additional operation that should not be necessary, and the additional memory usage is also considerable (I have only a 4 GB video card freely available).
Please help me with the optimal (or at least better) solution.
If it's important, I do know the number of columns beforehand (literally using #define numCol 10001
), but the number of rows can vary as the rows are parsed from multiple .csv files.
What you describe is a symmetric rank update. There is a family of functions specifically for this, e.g. cublasSsyrk
for floats.
https://docs.nvidia.com/cuda/cublas/index.html#cublas-lt-t-gt-syrk
Note that those only update the lower or upper triangular matrix as the other half is redundant