pythoncudanumba# remove zero values of an array in numba cuda

I have a long array

```
arr = np.array([1,1,2,2,3,3,0,0,2,2])
```

I would like to remove all zero values of this array in numba cuda since the real array is very large and numpy is very slow.

Does anyone know how this can be done?

Solution

As already mentioned in the comments, the algorithm that you are looking for is called **Stream Compaction**. Since the said algorithm is not available out-of-the-box in `numba`

, so it has to be implemented by hand.
Basically, the process is as follows:

- Create binary mask of input values to identify non-zero values.
- Calculate prefix sum of the mask
- For each non-zero value, copy it to the index specified by the output of prefix sum at the corresponding index of the value.

For step 3 explanation, consider the following example:

Lets say there is a non-zero value at index number 5 of the input array. We pick the value at index number 5 of the prefix sum array (lets say that value was 3). Use this value as the output index. It means that element at index number 5 of the input will go to index number 3 in the final output.

Calculation of prefix sum is the difficult part in this whole process. I took the liberty of porting the C++ version of prefix sum algorithm (adapted from the GPU Gems 3 book) to `numba`

. If we implement prefix sum ourselves, we can merge step 1 and 2 into a single CUDA kernel.

Following is a complete working example of `numba`

based stream compaction with the provided use case.

```
import numpy as np
from numba import cuda, int32
BLOCK_SIZE = 8
#CUDA kernel to calculate prefix sum of each block of input array
@cuda.jit('void(int32[:], int32[:], int32[:], int32)')
def prefix_sum_nzmask_block(a, b, s, length):
ab = cuda.shared.array(shape=(BLOCK_SIZE), dtype=int32)
tid = cuda.blockIdx.x * cuda.blockDim.x + cuda.threadIdx.x;
if tid < length:
ab[cuda.threadIdx.x] = int32(a[tid] != 0); #Load mask of input data into shared memory
i = 1
while i<=cuda.threadIdx.x:
cuda.syncthreads() #Total number of cuda.synchthread calls = log2(BLOCK_SIZE).
ab[cuda.threadIdx.x] += ab[cuda.threadIdx.x - i] #Perform scan on shared memory
i *= 2
b[tid] = ab[cuda.threadIdx.x]; #Write scanned blocks to global memory
if(cuda.threadIdx.x == cuda.blockDim.x-1): #Last thread of block
s[cuda.blockIdx.x] = ab[cuda.threadIdx.x]; #Write last element of shared memory into global memory
#CUDA kernel to merge the prefix sums of individual blocks
@cuda.jit('void(int32[:], int32[:], int32)')
def prefix_sum_merge_blocks(b, s, length):
tid = (cuda.blockIdx.x + 1) * cuda.blockDim.x + cuda.threadIdx.x; #Skip first block
if tid<length:
i = 0
while i<=cuda.blockIdx.x:
b[tid] += s[i] #Accumulate last elements of all previous blocks
i += 1
#CUDA kernel to copy non-zero entries to the correct index of the output array
@cuda.jit('void(int32[:], int32[:], int32[:], int32)')
def map_non_zeros(a, prefix_sum, nz, length):
tid = cuda.blockIdx.x * cuda.blockDim.x + cuda.threadIdx.x;
if tid < length:
input_value = a[tid]
if input_value != 0:
index = prefix_sum[tid] #The correct output index is the value at current index of prefix sum array
nz[index-1] = input_value
#Apply stream compaction algorithm to get only the non-zero entries from the input array
def get_non_zeros(a):
length = a.shape[0]
block = BLOCK_SIZE
grid = int((length + block - 1)/block)
#Create auxiliary array to hold the sum of each block
block_sum = cuda.device_array(shape=(grid), dtype=np.int32)
#Copy input array from host to device
ad = cuda.to_device(a)
#Create prefix sum output array
bd = cuda.device_array_like(ad)
#Perform partial scan of each block. Store block sum in auxillary array named block_sum.
prefix_sum_nzmask_block[grid, block](ad, bd, block_sum, length)
#Add block sum to the output
prefix_sum_merge_blocks[grid, block](bd,block_sum,length);
#The last element of prefix sum contains the total number of non-zero elements
non_zero_count = int(bd[bd.shape[0]-1])
#Create device output array to hold ONLY the non-zero entries
non_zeros = cuda.device_array(shape=(non_zero_count), dtype=np.int32)
#Copy ONLY the non-zero entries
map_non_zeros[grid, block](a, bd, non_zeros, length)
#Return to host
return non_zeros.copy_to_host()
if __name__ == '__main__':
arr = np.array([1,1,2,2,3,3,0,0,2,2], dtype=np.int32)
nz = get_non_zeros(arr)
print(nz)
```

**Tested with the following environment:**

Python 3.6.7 in virtual environment

CUDA 10.0.130

NVIDIA Driver 410.48

numba 0.42

Ubuntu 18.04

**Disclaimer:** The code is for demo purpose only and has not been profiled/tested rigorously for performance measurements.

- Python Jinja2 LaTeX Table
- Getting attributes of a class
- How can I print many significant figures in Python?
- How to allow list append() method to return the new list
- Calculate Last Friday of Month in Pandas
- Python type hint for Iterable[str] that isn't str
- How to iterate over a list in chunks
- How to exit the entire application from a Python thread?
- Running shell command and capturing the output
- How do I pass a variable by reference?
- Convert range(r) to list of strings of length 2 in python
- How can I get the start and end dates for each week?
- how to use send_message() in python-telegram-bot
- Python conditional replacement based on element type
- How can I count the number of items in an arbitrary iterable (such as a generator)?
- Find longest consecutive range of numbers in list
- Insert text in braces with asyncpg
- How does one put a link / url to the web-site's home page in Django?
- How to determine if a path is a subdirectory of another?
- Custom Keybindings for Ipython terminal
- FastAPI asynchronous background tasks blocks other requests?
- How to make sure that information from one file is duplicated into several text documents, without specific lines
- Installing a Python environment with Anaconda
- sklearn pipeline model predicting same results for all input
- Brew command not found after installing Anaconda Python
- How to get an XPath from selenium webelement or from lxml?
- Pipe PuTTY console to Python script
- How to align the axes of a figure in matplotlib?
- Persist ParentDocumentRetriever of langchain
- How to reset index in a pandas dataframe?