Assume you have a dense matrix of size 1500x500 and you need to multiply it with a block-diagonal matrix of size 500x500 that consists of ten sub-matrices of size 50x50 sitting on the diagonal:
S 0 ... 0 0
0 S 0 0
...
0 0 ... S 0
0 0 ... 0 S <- each S is 50x50
Sometimes all S are equal, sometimes they're not.
I haven't profiled yet but I feel like a straight CUBLAS multiplication would waste too much time with the zeros. Are there any efficient ways to implement such a multiplication?
You may use cuSparse with the Block Compressed Sparse Row Format as described here. Your matrix type may benefit from other optimizations, but this one is available now.
Alternately, you may use cublas<>gemmBatched accessing your dense matrix by blocks of row or columns, and defining your block-diagonal as a set of smaller dense matrices (potentially reusing same data).