Resampling an array fast with numba

I have a function that takes in an array an returns an upsampled version of it by using linear interpolation between the existing data points. The input array has approximately 5000 elements. I would also like to implement a version that handles 2d arrays with shape ~(4000, 5000).

Why is this function not faster with numba njit? It runs superfast on 1d arrays without numba. With numba.njit it takes much longer. Any advice is much appreciated!

import numpy as np
import numba as nb

@nb.njit
def sync_sampl(time_series, repeat=10):
    if time_series.ndim == 1:
        time_series = time_series.repeat(repeat)
        # from https://mail.python.org/pipermail/scipy-user/2010-November/027574.html
        # by Dharas Pothina
        x = time_series
        xtmp = x.copy()
        xtmp[1:] = x[1:] - x[:-1]
        x_idx = np.nonzero(xtmp)[0]
        x_vals = x[x_idx]
        x_new = np.interp(np.arange(len(x)), x_idx, x_vals)
        x_new = x_new[:-(repeat - 1) or None]  # removing last elements to not get straight line at the end.
        return x_new
    else:
        # Implement 2d version here?
        pass

Solution

Here are some tips to make the Numba code faster:

You can tell to Numba that the input array is contiguous (only if it is true) and pre-compile the function ahead of time not to pay expensive compilation times at runtime. Here is an example: @nb.njit('float[::1](float32[::1],int32)').
you can use the option parallel=True on @nb.njit to execute some Numpy function in parallel. However, most functions a not yet running in parallel with it. Still, you can run loops in parallel with that and nb.prange.
As said in the comments, Numba is often good with loops. Numba loops are not always faster because the compiled Numpy code tends to be better vectorized than the JIT code from Numba. However, loops enable you to avoid creating/filling/reading many temporary arrays making your code memory-bound. Well-optimized loops also often help to reduce the number of instructions required to perform a custom operations. In your case, the lines from xtmp = x.copy() to x_vals = x[x_idx] can be rewritten using one fast memory-efficient parallel loop (and so no temporary buffers).
Note that you can also use the option fastmath=True if you are sure that there is no NaN/+Inf/-Inf/-0 values in your code (and the need for exact IEEE-754 rules like rounding) to improve performance even further. Using 32-bit floats may help too despite the significant loss of precision.