Search code examples
pythongpunumba

How to use Numba CUDA JIT decorator?


I've followed this tutorial to use Numba CUDA JIT decorator: https://www.youtube.com/watch?v=-lcWV4wkHsk&t=510s.

Here is my Python code:

import numpy as np
from timeit import default_timer as timer
from numba import cuda, jit

# This function will run on a CPU
def fill_array_with_cpu(a):
      for k in range(100000000):
            a[k] += 1

# This function will run on a CPU with @jit
@jit
def fill_array_with_cpu_jit(a):
      for k in range(100000000):
            a[k] += 1         

# This function will run on a GPU
@jit(target_backend='cuda')
def fill_array_with_gpu(a):
      for k in range(100000000):
            a[k] += 1   

# Main
a = np.ones(100000000, dtype = np.float64)

for i in range(3):
      start = timer()
      fill_array_with_cpu(a)
      print("On a CPU:", timer() - start)

for i in range(3):
      start = timer()
      fill_array_with_cpu_jit(a)
      print("On a CPU with @jit:", timer() - start)

for i in range(3):
      start = timer()
      fill_array_with_gpu(a)
      print("On a GPU:", timer() - start)

And here is the prompt output:

On a CPU: 24.228116830999852
On a CPU: 24.90354355699992
On a CPU: 24.277727688999903
On a CPU with @jit: 0.2590671719999591
On a CPU with @jit: 0.09131158500008496
On a CPU with @jit: 0.09054700799993043
On a GPU: 0.13547917200003212
On a GPU: 0.0922475330000907
On a GPU: 0.08995077999998102

Using the @jit decorator greatly increases the processing speed. However, it is unclear to me that the @jit(target_backend='cuda') decorator allows the function to be processed on the GPU. The processing times are similar to the function with @jit. I suppose @jit(target_backend='cuda') does not use the GPU. Actually, I've tried this code on a machine where there is no NVIDIA GPU and I got the same result without any warning or error.

How to make it run on my GPU? I have a GeForce GT 730M.


Solution

  • There is no such thing as target_backend='cuda'. All the functions of the code are executed on the CPU (hence the same timings once the compilation time is discarded). AFAIK there was an option like this long time ago, but not anymore. The benchmark in the video is not actually correct for many reasons and I think it should not be trusted.


    Not only the benchmark in the video is not correct anymore, but was also biased when it was done. Indeed, even if it would exist and would work as we wish, it would not be efficient because the target array is stored on the host memory (typically in RAM). Thus, the array must be transferred to the GPU device memory, computed and the device and then transferred back from the device to the host memory. The thing is such data transfer is very expensive (and cannot be faster than the host memory). Moreover, the computation is so cheap that the CPU computation should be memory-bound though one core may not be enough to saturate the RAM bandwidth. One need to use a parallel CPU implementation to fully saturate the RAM on most platform. This is also better to compare a parallel CPU implementation with a GPU implementation since the later is inherently parallel. The provided benchmark is, at best, biased because of that. In the end, the GPU implementation cannot be faster because the data transfer cannot be faster than the parallel CPU implementation since both are limited by the host RAM. In fact, the GPU implementation should be slower because the CPU-GPU interconnect (typically PCIe) often cannot reach a throughput as big as the host RAM bandwidth.

    Last but not least, the array is of type float64 and all client-side mainstream Nvidia GPUs are not made for that: they are very slow to do such 64-bit floating-point (FP) computations. In fact, they are so slow that mainstream CPU can do computation faster. For example, your GT 730M GPU (low end very-old Kepler GPU) can reach 552 GFlops in 32-bit FP and only 23 GFlops in 64-bit FP. In comparison, the i5-4258U mobile CPU, released the same year, can reach 92 GFlops. This is 4 times more! If you want to do fast 64-bit FP computations on a GPU with CUDA, then you need a server-side Nvidia GPU supporting 64-bit FP computations natively (most do support them). Note that such GPUs are far more expensive though.

    Note that the first call to a Numba function includes the compilation time. This overhead must be discarded in benchmarks (by pre-compiling the function eagerly, or caching it or just discarding the timing of the first call).


    Put it shortly, this is a bad tutorial and your GPU can certainly not compute this specific operation faster than your CPU. I advise you to read the Numba's documentation which is significantly more reliable and up to date. You can also read the CUDA programming manual for more information and this Wikipedia page for information about your GPU.