Desired Compute-To-Memory-Ratio (OP/B) on GPU

I am trying to undertand the architecture of the GPUs and how we assess the performance of our programs on the GPU. I know that the application can be:

Compute-bound: performance limited by the FLOPS rate. The processor’s cores are fully utilized (always have work to do)
Memory-bound: performance limited by the memory bandwidth. The processor’s cores are frequently idle because memory cannot supply data fast enough

The image below shows the FLOPS rate, peak memory bandwidth, and the Desired Compute to memory ratio, labeled by (OP/B), for each microarchitecture.

I also have an example of how to compute this OP/B metric. Example: Below is part of a CUDA code for applying matrix-matrix multiplication

for(unsigned int i = 0; i < N; ++i) {
  sum += A[row*N + i]*B[i*N + col];
}

and the way to calculate OP/B for this matrix-matrix multiplication is as follows:

Matrix multiplication performs 0.25 OP/B
1 FP add and 1 FP mul for every 2 FP values (8B) loaded
Ignoring stores

and if we want to utilize this:

But matrix multiplication has high potential for reuse. For NxN matrices:
- Data loaded: (2 input matrices)×(N^2 values)×(4 B) = 8N^2 B
- Operations: (N^2 dot products)(N adds + N muls each) = 2N^3 OP
- Potential compute-to-memory ratio: 0.25N OP/B

So if I understand this clearly well, I have the following questions:

It is always the case that the greater OP/B, the better ?
how do we know how much FP operations we have ? Is it the adds and the multiplications
how do we know how many bytes are loaded per FP operation ?

Solution

It is always the case that the greater OP/B, the better ?

Not always. The target value balances the load on compute pipe throughput and memory pipe throughput (i.e. that level of op/byte means that both pipes will be fully loaded). As you increase op/byte beyond that or some level, your code will switch from balanced to compute-bound. Once your code is compute bound, the performance will be dictated by the compute pipe that is the limiting factor. Additional op/byte increase beyond this point may have no effect on code performance.

how do we know how much FP operations we have ? Is it the adds and the multiplications

Yes, for the simple code you have shown, it is the adds and multiplies. Other more complicated codes may have other factors (e.g. sin, cos, etc.) which may also contribute.

As an alternative to "manually counting" the FP operations, the GPU profilers can indicate the number of FP ops that a code has executed.

how do we know how many bytes are loaded per FP operation ?

Similar to the previous question, for simple codes you can "manually count". For complex codes you may wish to try to use profiler capabilities to estimate. For the code you have shown:

sum += A[row*N + i]*B[i*N + col];

The values from A and B have to be loaded. If they are float quantities then they are 4 bytes each. That is a total of 8 bytes. That line of code will require 1 floating point multiplication (A * B) and one floating point add operation (sum +=). The compiler will fuse these into a single instruction (fused multiply-add) but the net effect is you are performing two floating point operations per 8 bytes. op/byte is 2/8 = 1/4. The loop does not change the ratio in this case. To increase this number, you would want to explore various optimization methods, such as a tiled shared-memory matrix multiply, or just use CUBLAS.

(Operations like row*N + i are integer arithmetic and don't contribute to the floating-point load, although its possible they may be significant, performance-wise.)