Big difference in execution time for first and subsequent run of cupy functions

When I run cupy functions on cupy arrays, the first call of a function takes significantly longer than the second run, even if I run it on a different array the second time.

Why is this?

import cupy as cp
cp.__version__
# 7.5.0

A = cp.random.random((1024, 1024))
B = cp.random.random((1024, 1024))

from time import time
def test(func, *args):
    t = time()
    func(*args)
    print("{}".format(round(time() - t, 4)))
    
test(cp.fft.fft2, A)
test(cp.fft.fft2, B)
# 0.129
# 0.001
test(cp.matmul, A, A.T)
test(cp.matmul, B, B.T)
# 0.171
# 0.0
test(cp.linalg.inv, A)
test(cp.linalg.inv, B)
# 0.259
# 0.002

Solution

CuPy is just-in-time compiling the kernel under the hood the first time you use a function in a Python process, which takes a bit of time.

From the CuPy documentation:

CuPy uses on-the-fly kernel synthesis: when a kernel call is required, it compiles a kernel code optimized for the shapes and dtypes of given arguments, sends it to the GPU device, and executes the kernel. The compiled code is cached to $(HOME)/.cupy/kernel_cache directory (this cache path can be overwritten by setting the CUPY_CACHE_DIR environment variable). It may make things slower at the first kernel call, though this slow down will be resolved at the second execution. CuPy also caches the kernel code sent to GPU device within the process, which reduces the kernel transfer time on further calls.