Fastest way to do tan() and arctan() on large arrays with GPU in Python?

This is essentially an expansion of this question. The answer provided there says to use the math library for trig functions. Unfortunately, the math library only works on scalars, not arrays. When I try using njit with np.tan() and np.arctan(), it uses my CPU, not my GPU:

import numpy as np
import numba as nb

@nb.njit(fastmath=True, parallel=True)
def f1(a):
    np.tan(a)
    return np.arctan(a)

I have large arrays (3mil cols) that I need to do both tan() and arctan() on. I have 80 million rows of data to process, but I can batch a few a time to fit it in memory and I can divide these 80mil up into multiple CPU/GPU jobs to help parallelize it. I'd like to use my GPU for all these trig functions if it's faster. Will Numba work for this? Will something like PyTorch work here or does that have too much overhead? What's the fastest way to do trig functions on large arrays?

EDIT: These 3mil column arrays that make up each row are generated dynamically at runtime. I can batch up to ~20 of these at a time without running out of memory, before sending the whole (20,3mil) array for the 2 trig functions and saving the results (thus freeing up memory for the next batch of 20). So I'm just looking for the fastest way to do tan() and arctan() on a (20,3mil) array, and I'll loop through my data 20 rows at a time. CPU or GPU for this? Numba or PyTorch? @njit or @vectorize?

Solution

I am going to show you several implementations. There is code at the end of this post for benchmarking, so please try for yourself and find out which one works best for you.

This is the basic implementation that uses numba jit.

import math
import numba as nb
import numpy as np


@nb.njit("float64[:](float64[:])", parallel=True)
def numba_jit_cpu(a):
    out = np.empty_like(a)
    for i in nb.prange(a.shape[0]):
        out[i] = math.atan(a[i])
    return out

Regarding your first question, the key is to use the math module, not numpy.

Here is the benchmark result for 20*3 million elements.

numba_jit_cpu: 121 ms

There are 4 million batches, so it will take about 5.6 days in total.

However, in fact, more than half of this time is spent allocating out, which is where the output is stored. So we could make the function take out as an argument and write to it, like many numpy APIs do.

@nb.njit("void(float64[:],float64[:])", parallel=True)
def numba_jit_cpu_buf(a, out):
    for i in nb.prange(a.shape[0]):
        out[i] = math.atan(a[i])

numba_jit_cpu_buf: 51 ms

Now it will take about 2.4 days in total. This is the fastest way I can currently offer.

As a comparison, let's compare the time it takes to allocate and write to the same size of memory.

def single_thread_allocation(a):
    return np.ones_like(a)


def single_thread_write(a, x):
    x[:] = a

single_thread_allocation: 87 ms
single_thread_write: 27 ms

As you can see, numba_jit_cpu_buf is faster than allocating and only about 2 times slower than writing. If you want a faster way than this, you have to optimize every memory allocation and write. This level of optimization is required.

But since I am already tried anyway, here are some other ways to do it. The first one is to use numba's vectorize.

@nb.vectorize("float64(float64)", target="parallel", nopython=True)
def numba_vectorize_cpu(a):
    return math.atan(a)


@nb.vectorize("float64(float64)", target="cuda")
def numba_vectorize_cuda(a):
    return math.atan(a)

Note that to use cuda option, you need to install cudatoolkit.

These are very easy to implement, yet one of the most efficient numba-based solutions.

numba_vectorize_cpu: 101 ms
numba_vectorize_cuda: 213 ms

The reason it is actually slower on GPUs is definitely because data transfer is the bottleneck. Below you can see how long numba takes to transfer.

from numba import cuda


def numba_copy_only(a):
    ac = cuda.to_device(a)
    return ac.copy_to_host()

numba_copy_only: 188 ms

Remember, this is more than three times longer than numba_jit_cpu_buf just to transfer the data. If your data was float32, there was still a possibility, but with float64 there is no chance.

Let's try the pytorch solutions as well.

import torch


class Model(torch.nn.Module):
    def forward(self, x):
        return torch.atan(x)


def pytorch_cpu(model, a):
    x = torch.from_numpy(a)
    return model(x).numpy()


def pytorch_cuda(model, a):
    x = torch.from_numpy(a).to("cuda")
    return model(x).cpu().numpy()


def pytorch_copy_only(a):
    x = torch.from_numpy(a).to("cuda")
    return x.cpu().numpy()

pytorch_cpu: 112 ms
pytorch_cuda: 186 ms
pytorch_copy_only: 180 ms

Not much difference.

It should be noted, that it is theoretically possible to make it faster by taking advantage of DataLoader features such as prefetching and pinned memory. However, it is not easy and you may find yourself going down the rabbit hole. Keep in mind that with just one extra memory allocation or two writes, it will be slower than numba_jit_cpu_buf.

Here is the rest of benchmark code.

import timeit


def baseline(a):
    return np.arctan(a)


def benchmark():
    rng = np.random.default_rng(0)
    batch = rng.random((20, 3 * 1000 * 1000), dtype=np.float64) * np.pi - np.pi / 2

    expected = baseline(batch)

    n = 100
    elapsed = timeit.timeit(lambda: baseline(batch), number=n) / n
    print(f"baseline: {elapsed*1000:.0f} ms")

    elapsed = timeit.timeit(lambda: single_thread_allocation(batch), number=n) / n
    print(f"single_thread_allocation: {elapsed * 1000:.0f} ms")

    buf = np.zeros_like(batch)
    elapsed = timeit.timeit(lambda: single_thread_write(batch, buf), number=n) / n
    print(f"single_thread_write: {elapsed * 1000:.0f} ms")

    assert np.allclose(expected.ravel(), numba_jit_cpu(batch.ravel()))
    elapsed = timeit.timeit(lambda: numba_jit_cpu(batch.ravel()), number=n) / n
    print(f"numba_jit_cpu: {elapsed * 1000:.0f} ms")

    buf = np.zeros_like(batch)
    numba_jit_cpu_buf(batch.ravel(), buf.ravel())
    assert np.allclose(expected, buf)
    elapsed = timeit.timeit(lambda: numba_jit_cpu_buf(batch.ravel(), buf.ravel()), number=n) / n
    print(f"numba_jit_cpu_buf: {elapsed * 1000:.0f} ms")

    assert np.allclose(expected, numba_vectorize_cpu(batch))
    elapsed = timeit.timeit(lambda: numba_vectorize_cpu(batch), number=n) / n
    print(f"numba_vectorize_cpu: {elapsed * 1000:.0f} ms")

    assert np.allclose(expected, numba_vectorize_cuda(batch))
    elapsed = timeit.timeit(lambda: numba_vectorize_cuda(batch), number=n) / n
    print(f"numba_vectorize_cuda: {elapsed * 1000:.0f} ms")

    elapsed = timeit.timeit(lambda: numba_copy_only(batch), number=n) / n
    print(f"numba_copy_only: {elapsed * 1000:.0f} ms")

    model = Model().eval()
    assert np.allclose(expected, pytorch_cpu(model, batch))
    elapsed = timeit.timeit(lambda: pytorch_cpu(model, batch), number=n) / n
    print(f"pytorch_cpu: {elapsed * 1000:.0f} ms")

    model = model.to("cuda")
    assert np.allclose(expected, pytorch_cuda(model, batch))
    elapsed = timeit.timeit(lambda: pytorch_cuda(model, batch), number=n) / n
    print(f"pytorch_cuda: {elapsed * 1000:.0f} ms")

    elapsed = timeit.timeit(lambda: pytorch_copy_only(batch), number=n) / n
    print(f"pytorch_copy_only: {elapsed * 1000:.0f} ms")


if __name__ == "__main__":
    benchmark()