number of operations for sparse*dense matrix multiplication

How many floating point operations does it take to multiply a CSR sparse x dense matrix or a dense x CSR sparse matrix using optimized sparse routines (like cuSparse or Eigen or Matlab).
In the limit where the sparse matrix is fully dense, the number of operations is N^2*(2*N-1) - so why are sparse routines slower than dense routines when the the sparse matrix is not sparse enough? What additional work is being done?

Solution

For R += S * D the number of flops is 2*nnz(S)*ncols(D), where nnz stands for number-of-non-zeros.
If the sparse matrix S becomes dense then the number of flops is the same as in the dense case, but the number of flops is not the unique criterion determining speed, memory accesses is usually more important. First, with the sparse storage each access to an element of S will cost an additional indirection, something like: value[p[k]] instead of value[i+j*N] for the dense case. Then in the dense world comes blocking algorithms to reduce cache-misses, vectorization (SIMD), and optimal pipelining of the instructions to reach the maximal performance of the CPU. An efficient dense matrix product is indeed orders of magnitude more sophisticated than the naive 3 nested loops, see for yourself Eigen's implementation. Pretty scary isn't it?