Search code examples
pythoncudacupy

Why (x / y)[i] faster than x[i] / y[i]?


I'm new to CuPy and CUDA/GPU computing. Can someone explain why (x / y)[i] faster than x[i] / y[i]?

When taking advantage of GPU accelerated computations, are there any guidelines that would allow me to quickly determine which operation is faster? Which to avoid benchmarking every operation.

# In VSCode Jupyter Notebook

import cupy as cp
from cupyx.profiler import benchmark

x = cp.arange(1_000_000)
y = (cp.arange(1_000_000) + 1) / 2
i = cp.random.randint(2, size=1_000_000) == 0
x, y, i

# Output:
(array([     0,      1,      2, ..., 999997, 999998, 999999], shape=(1000000,), dtype=int32),
 array([5.000000e-01, 1.000000e+00, 1.500000e+00, ..., 4.999990e+05,
        4.999995e+05, 5.000000e+05], shape=(1000000,), dtype=float64),
 array([ True, False,  True, ...,  True, False,  True], shape=(1000000,), dtype=bool))
def test1(x, y, i):
    return (x / y)[i]

def test2(x, y, i):
    return x[i] / y[i]

print(benchmark(test1, (x, y, i)))
print(benchmark(test2, (x, y, i)))

# Output:
test1: CPU: 175.164 us +/- 61.250 (min: 125.200 / max: 765.100) us  GPU-0: 186.001 us +/- 67.314 (min: 134.144 / max: 837.568) us
test2: CPU: 342.364 us +/- 130.840 (min: 223.000 / max: 1277.600) us  GPU-0: 368.133 us +/- 136.911 (min: 225.504 / max: 1297.408) us

Solution

  • The ratio of arithmetic capability to bytes retrieval capability on a GPU (at least, maybe also CPU) is generally lopsided in favor of arithmetic capability. Therefore algorithm performance can often be predicted by the the number of memory operations required. The first step in the x[i]/y[i] realization is to create x[i] and y[i], which is pure memory operations - stream compaction. Then the reduced level of arithmetic occurs. In the other realization, a higher level of arithmetic occurs first, followed by a 50% reduction in memory ops.

    A high end data center GPU - e.g. the A100, could have ~10 TF of FP64 throughput, and ~2 TB/s of memory bandwidth, so that ratio would be 5 FP64 FLOPs per byte of data read or written. For an 8-byte FP64 quantity, that is 40 FLOPs per FP64 quantity read or written. (Division requires multiple FLOPs.)

    Let's assume that the i array is composed of 50% True and 50% false. Now lets count all the steps required.

    In the x[i]/y[i] realization, we must

    • read the entire i array - 1,000,000 reads.
    • use that to read a portion of the x and y arrays, but on a GPU, you don't get to read individual elements. You read memory in chunks of 32 bytes at a time, so realistically with evenly distributed True/False distribution, you may end up reading the entire x and y arrays - 2,000,000 reads
    • stream compaction on x, y according to i
    • perform the division arithmetic - 500,000 division ops.
    • write results - 500,000 writes

    In the (x/y)[i] realization:

    • read the entire i array - 1,000,000 reads.
    • read the entire x and y arrays - 2,000,000 reads
    • perform the division arithmetic - 1,000,000 division ops.
    • stream compaction on division result according to i
    • write results - 500,000 writes

    The only difference is the positioning of the stream compaction operation(s) and the number of division operations required. Stream compaction is basically a purely memory bound activity, so the difference is in the cost of accessing memory. The first approach requires twice as much stream compaction.

    On a GPU, at least, flops are often significantly less "expensive" than reading/writing data.

    The percentage ratio of True/False in i doesn't really affect this analysis for A100. Other GPUs like T4 have a relatively low throughput for FP64 compared to FP32. T4 has a ratio of 320GB/s of memory bandwidth to ~8TF of FP32 thoughput, but only 1/8TF or 125GF of FP64 throughput. At high ratios of True to False, this won't matter, but at low ratios of True to False, for T4, its possible that the cost of division is a factor. This distinction would probably be wiped out by switching to FP32 datatype. Another way to say this could be that T4, for FP64 ops, doesn't necessarily satisfy the initial premise of my answer here: "The ratio of arithmetic capability to bytes retrieval capability on a GPU (at least, maybe also CPU) is generally lopsided in favor of arithmetic capability."