time time-complexity linear-algebra benchmarking blas

What is the time complexity of Trsm and other BLAS operations?

I am accelerating a model by replacing all its linear algebra operations with cuBlas's functions. And I want to get the time complexity or FLOPs of the model to evaluate its performance in roofline model.

There are two kinds of operation in the model: Gemm and Trsm.

I know the FLOPs of Gemm is about 2 * k * m * n from the question : How to compute the achieved FLOPS of a MPI program which calls cuBlas function:

The standard BLAS GEMM operation is C <- alpha * (A dot B) + beta * C and for A (m by k), B (k by n) and C (m by n), each inner product of a row of A and a column of B multiplied by alpha is 2 * k + 1 flop and there are m * n inner products in A dot B and another 2 * m * n flop for adding beta * C to that dot product. So the total model FLOP count is (2 * k + 3) * (m * n) when alpha and beta are both non-zero.

But for Trsm, I have no idea about its computation complexity. All the documents I found say it's about O(n^3) which isn't clear enough to get the computation complexity.

Sincerely thank you for your answers!

Solution

But for Trsm, I have no idea about its computation complexity. All the documents I found say it's about O(n^3) which isn't clear enough to get the computation complexity.

It's about O(n^2) not O(n^3).

TRSM performs a number of different implementation depending on the form of the matrix, but for the canonical B <- alpha * inv(A) dot B where B is (m by n) and A is a non-unit upper triangular matrix, the operation count should be something like n * (3/2*m + 1) [n diagonal scaling operations plus 1/2 * 3 * m * n operations over all the non-zero row entries]

If you want to know the exact FLOP count for the reference implementation of any of the BLAS or LAPACK routines, the obvious way to do it is inspect the reference serial code and count the floating point operations, either by hand or by instrumentation of the code itself.