Search code examples
pythoncudaprogress-barnumbatqdm

Creating a progress bar in python with Numba and Cuda


I'm running a parallel process using numba and CUDA (on windows) that will take quite a while. It would be nice to have an updating progress bar printed in the console, so I can see how far through all the threads it is. Something like tqdm would be absolutely perfect, but for CUDA.

I've tried using tqdm, and numba-progress, but neither seem to work with CUDA. I've also tried my own class based solution but alas, you cannot pass classes into a kernel function (i think). I found this thread which also describes the problem I want to solve, but no replies. All other posts I've found have not been for CUDA.

Here's some skeleton code of what I'd like to put a progress bar on:

from __future__ import print_function, absolute_import

from numba import cuda
from numba.cuda.random import create_xoroshiro128p_states, xoroshiro128p_uniform_float32
import numpy as np
from math import gamma, exp, ceil

        
# This function is just an example of what i'd like to put a progress bar on
@cuda.jit
def generate_samples(rng_states, out, rate):
    thread_id = cuda.grid(1)

    def poission_sample(rate, random_number): 
        probability_sum = 0
        index = -1
        while probability_sum < random_number:
            index += 1
            probability_sum += ((rate**index)/gamma(index+1)) * exp(-rate)
            
        return index
    
    # Ideally increment a global counter of some kind here, or have a module that does it for me
    
    out[thread_id] = poission_sample(rate, xoroshiro128p_uniform_float32(rng_states, thread_id))

number_of_samples = 10000000

threads_per_block = 512
blocks = ceil(number_of_samples/threads_per_block)
rng_states = create_xoroshiro128p_states(threads_per_block * blocks, seed=1)
out = np.zeros(threads_per_block * blocks, dtype=np.float32)

generate_samples[blocks, threads_per_block](rng_states, out, 5)
    
print('Average Sample:', out.mean())

Any help would be massively appreciated!


Solution

  • You may be able to use a numba cuda mapped_array to help with this task. Underneath the hood, this is telling numba to create a pinned allocation and make it usable on the device, which informs numba not to copy it to the device, even though a pinned_array normally appears to numba like a host array.

    Coupled with that, we will need to make sure that numba is not trying to copy arrays, as that will result in synchronization in the "automatic" case, which we don't want.

    I don't really know how to measure the progress of that algorithm. For example, the while loop in poisson_sample seems to iterate 4 times on the item whose thread_id is zero, but I doubt that is true across the out array. (I do have a better idea about how to monitor the progress of other algorithms.)

    If we know how long an algorithm should take based on the progress, then we can simply monitor the value reported by the kernel. When it gets to 100% (or nearly), we stop monitoring and proceed with the rest of the work.

    I'll arbitrarily decide for demonstration purposes that the progress of this algorithm is measured by the number of threads that have completed the work.

    When we are unable to determine progress based on the progress report from the kernel (e.g. your case, for me, anyway) then an alternative is to continue to monitor and report progress until kernel completion is signalled by an event.

    Anyhow, the following works for me on linux, as a rough sketch. This is demonstrating with the use of events, although if you know the progress of the algorithm, the events are not really needed. Here is the version with events:

    $ cat t1.py
    from __future__ import print_function, absolute_import
    
    from numba import cuda
    from numba.cuda.random import create_xoroshiro128p_states, xoroshiro128p_uniform_float32
    import numpy as np
    from math import gamma, exp, ceil
    
    
    # This function is just an example of what i'd like to put a progress bar on
    @cuda.jit
    def generate_samples(rng_states, out, rate, progress):
        thread_id = cuda.grid(1)
    
        def poission_sample(rate, random_number, progress):
            probability_sum = 0
            index = -1
            while probability_sum < random_number:
                index += 1
                probability_sum += ((rate**index)/gamma(index+1)) * exp(-rate)
            cuda.atomic.add(progress, 0, 1)
            return index
    
        out[thread_id] = poission_sample(rate, xoroshiro128p_uniform_float32(rng_states, thread_id), progress)
    
    number_of_samples = 10000000
    progress = cuda.mapped_array(1, dtype=np.int64)
    progress[0] = 0;
    last_pct = 0
    my_e = cuda.event()
    threads_per_block = 512
    blocks = ceil(number_of_samples/threads_per_block)
    my_divisor = (threads_per_block * blocks) // 100
    rng_states = create_xoroshiro128p_states(threads_per_block * blocks, seed=1)
    out = np.zeros(threads_per_block * blocks, dtype=np.float32)
    out_d = cuda.device_array_like(out)
    generate_samples[blocks, threads_per_block](rng_states, out_d, 5, progress)
    my_e.record()
    print(last_pct)
    while my_e.query() == False:
        cur_pct = progress[0]/my_divisor
        if cur_pct > last_pct + 10:
            last_pct = cur_pct
            print(cur_pct)
    out = out_d.copy_to_host()
    
    print('Average Sample:', out.mean())
    $ python3 t1.py
    0
    10.00129996100117
    20.00291991240263
    30.004539863804087
    40.00519984400468
    50.00713978580642
    60.00811975640731
    70.00941971740848
    80.01039968800936
    90.01105966820995
    Average Sample: 5.000568
    $
    

    Here is a version without events:

    $ cat t2.py
    from __future__ import print_function, absolute_import
    
    from numba import cuda
    from numba.cuda.random import create_xoroshiro128p_states, xoroshiro128p_uniform_float32
    import numpy as np
    from math import gamma, exp, ceil
    
    
    # This function is just an example of what i'd like to put a progress bar on
    @cuda.jit
    def generate_samples(rng_states, out, rate, progress):
        thread_id = cuda.grid(1)
    
        def poission_sample(rate, random_number, progress):
            probability_sum = 0
            index = -1
            while probability_sum < random_number:
                index += 1
                probability_sum += ((rate**index)/gamma(index+1)) * exp(-rate)
            cuda.atomic.add(progress, 0, 1)
            return index
    
        out[thread_id] = poission_sample(rate, xoroshiro128p_uniform_float32(rng_states, thread_id), progress)
    
    number_of_samples = 10000000
    progress = cuda.mapped_array(1, dtype=np.int64)
    progress[0] = 0;
    last_pct = 0
    threads_per_block = 512
    blocks = ceil(number_of_samples/threads_per_block)
    my_divisor = (threads_per_block * blocks) // 100
    rng_states = create_xoroshiro128p_states(threads_per_block * blocks, seed=1)
    out = np.zeros(threads_per_block * blocks, dtype=np.float32)
    out_d = cuda.device_array_like(out)
    generate_samples[blocks, threads_per_block](rng_states, out_d, 5, progress)
    print(last_pct)
    while last_pct < 90:
        cur_pct = progress[0]/my_divisor
        if cur_pct > last_pct + 10:
            last_pct = cur_pct
            print(cur_pct)
    out = out_d.copy_to_host()
    
    print('Average Sample:', out.mean())
    $ python3 t2.py
    0
    10.000019999400019
    20.000039998800037
    30.000059998200054
    40.000079997600075
    50.00009999700009
    60.00011999640011
    70.00013999580013
    80.00015999520015
    90.00017999460016
    Average Sample: 5.000568
    $
    

    I ran both of these on linux. The version without the use of events may work better on windows, or possibly the other way (the event query may push work submission along). If you are using a display GPU on windows (i.e. a GPU not in TCC mode), then WDDM work batching/scheduling may possibly present an issue. You could try both settings for Windows Hardware Accelerated GPU Scheduling to see if one option works better than the other.

    Also, this kernel runs in less than a second on my GPU (the kernel duration is about 300ms, actually, on my GTX 970 GPU). So this might not be an interesting test case.