How to optimize matrix multiplication on itself transposed using CUDA?

I have a matrix (M) of floats, roughly 17000 by 10000 values. I need to get scalar multiplications of each row by each row (so 17000 by 17000 values), which can be alternatively formalized as multiplying M by the transposed M.

I am new to CUDA, so I can write a "naive" solution using a thread for every matrix element but it's probably suboptimal computation speed-wise.

Alternatively, I can use something like cublasSgemm(...) with M and the transposed M as arguments, but the transposing is an additional operation that should not be necessary, and the additional memory usage is also considerable (I have only a 4 GB video card freely available).

Please help me with the optimal (or at least better) solution.

If it's important, I do know the number of columns beforehand (literally using #define numCol 10001), but the number of rows can vary as the rows are parsed from multiple .csv files.

Solution

What you describe is a symmetric rank update. There is a family of functions specifically for this, e.g. cublasSsyrk for floats. https://docs.nvidia.com/cuda/cublas/index.html#cublas-lt-t-gt-syrk

Note that those only update the lower or upper triangular matrix as the other half is redundant