python performance numpy scientific-computing numba

Efficient algorithm for evaluating a 1-d array of functions on a same-length 1d numpy array

I have a (large) length-N array of k distinct functions, and a length-N array of abcissa. I want to evaluate the functions at the abcissa to return a length-N array of ordinates, and critically, I need to do it very fast.

I have tried the following loop over a call to np.where, which is too slow:

Create some fake data to illustrate the problem:

def trivial_functional(i): return lambda x : i*x
k = 250
func_table = [trivial_functional(j) for j in range(k)]
func_table = np.array(func_table) # possibly unnecessary

We have a table of 250 distinct functions. Now I create a large array with many repeated entries of those functions, and a set of points of the same length at which these functions should be evaluated.

Npts = 1e6
abcissa_array = np.random.random(Npts)
function_indices = np.random.random_integers(0,len(func_table)-1,Npts)
func_array = func_table[function_indices]

Finally, loop over every function used by the data and evaluate it on the set of relevant points:

desired_output = np.zeros(Npts)
for func_index in set(function_indices):
    idx = np.where(function_indices==func_index)[0]
    desired_output[idx] = func_table[func_index](abcissa_array[idx])

This loop takes ~0.35 seconds on my laptop, the biggest bottleneck in my code by an order of magnitude.

Does anyone see how to avoid the blind lookup call to np.where? Is there a clever use of numba that can speed this loop up?

Solution

This does almost the same thing as your (excellent!) self-answer, but with a bit less rigamarole. It seems marginally faster on my machine as well -- about 30ms based on a cursory test.

def apply_indexed_fast(array, func_indices, func_table):
    func_argsort = func_indices.argsort()
    func_ranges = list(np.searchsorted(func_indices[func_argsort], range(len(func_table))))
    func_ranges.append(None)
    out = np.zeros_like(array)
    for f, start, end in zip(func_table, func_ranges, func_ranges[1:]):
        ix = func_argsort[start:end]
        out[ix] = f(array[ix])
    return out

Like yours, this splits a sequence of argsort indices into chunks, each of which corresponds to a function in func_table. It then uses each chunk to select input and output indices for its corresponding function. To determine the chunk boundaries, it uses np.searchsorted instead of np.unique -- where searchsorted(a, b) could be thought of as a binary search algorithm that returns the index of the first value in a equal to or greater than the given value or values in b.

Then the zip function simply iterates over its arguments in parallel, returning a single item from each one, collected together in a tuple, and stringing those together into a list. (So zip([1, 2, 3], ['a', 'b', 'c'], ['b', 'c', 'd']) returns [(1, 'a', 'b'), (2, 'b', 'c'), (3, 'c', 'd')].) This, along with the for statement's built-in ability to "unpack" those tuples, allows for a terse but expressive way to iterate over multiple sequences in parallel.

In this case, I've used it to iterate over the functions in func_tables alongside two out-of-sync copies of func_ranges. This ensures that the item from func_ranges in the end variable is always one step ahead of the item in the start variable. By appending None to func_ranges, I ensure that the final chunk is handled gracefully -- zip stops when any one of its arguments runs out of items, which cuts off the final value in the sequence. Conveniently, the None value also serves as an open-ended slice index!

Another trick that does the same thing requires a few more lines, but has lower memory overhead, especially when used with the itertools equivalent of zip, izip:

range_iter_a = iter(func_ranges)   # create generators that iterate over the 
range_iter_b = iter(func_ranges)   # values in `func_ranges` without making copies
next(range_iter_b, None)           # advance the second generator by one
for f, start, end in itertools.izip(func_table, range_iter_a, range_iter_b):
    ...

However, these low-overhead generator-based approaches can sometimes be a bit slower than vanilla lists. Also, note that in Python 3, zip behaves more like izip.