How to make numpy clip run faster?

I have a custom machine learning objective function which is a kind of a linear bounded function and mainly use numpy.clip. During training, the objective function will be run a whole of times. The training time thus depends on how fast numpy.clip can run.

So, my question is 'is there anything I can do to make numpy.clip run faster?'

So far, I tried numba but I basically get no improvement at all (example below).

import timeit
from numba import jit
import numpy as np

def clip(x, l, u):
    return x.clip(l, u)

@jit(nopython=True, parallel=True, fastmath=True)
def clip2(x, l, u):
    return x.clip(l, u)

x = np.random.rand(1000)
l = -np.random.rand(1000)
u = np.random.rand(1000)

timeit.timeit(lambda: clip(x, l, u))
>>> 7.600710524711758
timeit.timeit(lambda: clip2(x, l, u))
>>> 23.19934402871877

Is there anything wrong with the way I use numba or it really cannot help in this case?

Is there any other approach worth a try?

One note is for my use case, the vector length for x, l and u in clip (defined above) is mainly around 1000. So, I really want to optimize for such a particular case.

Thanks so much for your help.

Solution

As pointed out in the comments, Numba introduces some compilation overhead the first time the function is called (for a particular datatype signature). Whether that should be included in the benchmark is difficult to answer based on the limited information you've shared.

The Numpy functions supported by Numba are convenient and robust, but you can often gain a little extra performance by implementing a specific function for your application.

The parallel=True doesn't do anything as shown by the warning.

Using np.clip you could perhaps gain a little by using the out= keyword if you're willing to modify the input (in place).

Overall I get the best performance using numba.vectorize, as is often the case in my experience.

from numba import njit, vectorize
import numpy as np

def clip1(x, l, u):
    return x.clip(l, u)

@njit(fastmath=True)
def clip2(x, l, u):
    return x.clip(l, u)
    
@njit(fastmath=True)
def clip3(x, l, u):
    return np.clip(x, l, u, out=x)
    
@vectorize
def clip4(x, l, u):
    return max(min(x, u), l)

On my machine, with a warm-up (excluding compilation), this results in:

clip1: 7.19 µs ± 546   ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
clip2: 2.88 µs ±  35.9 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
clip3: 2.54 µs ± 177   ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
clip4: 1.2  µs ±  39.3 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)