python performance list reduce partial-application

ordered reduce for multiple functions in python

Ordered list reduction

I need to reduce some lists where, depending on element types, the speed and implementation of the binary operation varies, i.e. large speed reductions can be gained by reducing some pairs with specific functions first. For example foo(a[0], bar(a[1], a[2])) might be a lot slower than bar(foo(a[0], a[1]), a[2]) but in this case give the same result.

I have the code that produces an optimal ordering in the form of a list of tuples (pair_index, binary_function) already. I am struggling to implement an efficient function to perform the reduction, ideally one that returns a new partial function which can then be used repeatedly on lists of the same type-ordering but varying values.

Simple and slow(?) solution

Here is my naive solution involving a for loop, deletion of elements and closure over the (pair_index, binary_function) list to return a 'precomputed' function.

def ordered_reduce(a, pair_indexes, binary_functions, precompute=False):
    """
    a: list to reduce, length n
    pair_indexes: order of pairs to reduce, length (n-1)
    binary_functions: functions to use for each reduction, length (n-1)
    """
    def ord_red_func(x):
        y = list(x)  # copy so as not to eat up
        for p, f in zip(pair_indexes, binary_functions):
            b = f(y[p], y[p+1])
            # Replace pair
            del y[p]
            y[p] = b
        return y[0]

    return ord_red_func if precompute else ord_red_func(a)

>>> foos = (lambda a, b: a - b, lambda a, b: a + b, lambda a, b: a * b)
>>> ordered_reduce([1, 2, 3, 4], (2, 1, 0), foos)
1
>>> 1 * (2 + (3-4))
1

And how pre-compution works:

>>> foo = ordered_reduce(None, (0, 1, 0), foos)
>>> foo([1, 2, 3, 4])
-7
>>> (1 - 2) * (3 + 4)
-7

However it involves copying the whole list and is also (therefore?) slow. Is there a better/standard way to do this?

(EDIT:) Some Timings:

from operators import add
from functools import reduce
from itertools import repeat
from random import random

r = 100000
xs = [random() for _ in range(r)]
# slightly trivial choices of pairs and functions, to replicate reduce
ps = [0]*(r-1)
fs = repeat(add)
foo = ordered_reduce(None, ps, fs, precompute=True)

>>> %timeit reduce(add, xs)
100 loops, best of 3: 3.59 ms per loop
>>> %timeit foo(xs)
1 loop, best of 3: 1.44 s per loop

This is kind of worst case scenario, and slightly cheating as reduce does not take a iterable of functions, but a function which does (but no order) is still pretty fast:

def multi_reduce(fs, xs):
    xs = iter(xs)
    x = next(xs)
    for f, nx in zip(fs, xs):
        x = f(x, nx)
    return x

>>> %timeit multi_reduce(fs, xs)
100 loops, best of 3: 8.71 ms per loop

(EDIT2): and for fun, the performance of a massively cheating 'compiled' version, which gives some idea of the total overhead occurring.

from numba import jit

@jit(nopython=True)
def numba_sum(xs):
    y = 0
    for x in xs:
        y += x
    return y

>>> %timeit numba_sum(xs)
1000 loops, best of 3: 1.46 ms per loop

Solution

When I read this problem, I immediately thought of reverse Polish notation (RPN). While it may not be the best approach, it still gives a substantial speedup in this case.

My second thought is that you may get an equivalent result if you just reorder the sequence xs appropriately to get rid of del y[p]. (Arguably the best performance would be achieved if the whole reduce procedure is written in C. But it's a different kettle of fish.)

Reverse Polish Notation

If you are not familiar with RPN, please read the short explanation in the wikipedia article. Basically, all operations can be written down without parentheses, for example (1-2)*(3+4) is 1 2 - 3 4 + * in RPN, while 1-(2*(3+4)) becomes 1 2 3 4 + * -.

Here is a simple implementation of an RPN parser. I separated an list of objects from an RPN sequence, so that the same sequence can be used for directly for different lists.

def rpn(arr, seq):
    '''
    Reverse Polish Notation algorithm
    (this version works only for binary operators)
    arr: array of objects 
    seq: rpn sequence containing indices of objects from arr and functions
    '''
    stack = []
    for x in seq:
        if isinstance(x, int):
        # it's an object: push it to stack
            stack.append(arr[x])
        else:
        # it's a function: pop two objects, apply the function, push the result to stack 
            b = stack.pop()
            #a = stack.pop()
            #stack.append(x(a,b))
            ## shortcut:
            stack[-1] = x(stack[-1], b)
    return stack.pop()

Example of usage:

# Say we have an array 
arr = [100, 210, 42, 13] 
# and want to calculate 
(100 - 210) * (42 + 13) 
# It translates to RPN: 
100 210 - 42 13 + * 
# or 
arr[0] arr[1] - arr[2] arr[3] + * 
# So we apply `
rpn(arr,[0, 1, subtract, 2, 3, add, multiply])

To apply RPN to your case you'd need either to generate rpn sequences from scratch or to convert your (pair_indexes, binary_functions) into them. I haven't thought about a converter but it surely can be done.

Tests

Your original test comes first:

r = 100000
xs = [random() for _ in range(r)]
ps = [0]*(r-1)
fs = repeat(add)
foo = ordered_reduce(None, ps, fs, precompute=True)
rpn_seq = [0] + [x for i, f in zip(range(1,r), repeat(add)) for x in (i,f)]
rpn_seq2 = list(range(r)) + list(repeat(add,r-1))
# Here rpn_seq denotes (_ + (_ + (_ +( ... )...))))
# and rpn_seq2 denotes ((...( ... _)+ _) + _).
# Obviously, they are not equivalent but with 'add' they yield the same result. 

%timeit reduce(add, xs)
100 loops, best of 3: 7.37 ms per loop
%timeit foo(xs)
1 loops, best of 3: 1.71 s per loop
%timeit rpn(xs, rpn_seq)
10 loops, best of 3: 79.5 ms per loop
%timeit rpn(xs, rpn_seq2)
10 loops, best of 3: 73 ms per loop

# Pure numpy just out of curiosity:
%timeit np.sum(np.asarray(xs))
100 loops, best of 3: 3.84 ms per loop
xs_np = np.asarray(xs)
%timeit np.sum(xs_np)
The slowest run took 4.52 times longer than the fastest. This could mean that an intermediate result is being cached 
10000 loops, best of 3: 48.5 µs per loop

So, rpn was 10 times slower than reduce but about 20 times faster than ordered_reduce.

Now, let's try something more complicated: alternately adding and multiplying matrices. I need a special function for it to test against reduce.

add_or_dot_b = 1
def add_or_dot(x,y):
    '''calls 'add' and 'np.dot' alternately'''
    global add_or_dot_b
    if add_or_dot_b:
        out = x+y
    else:
        out = np.dot(x,y)
    add_or_dot_b = 1 - add_or_dot_b
    # normalizing out to avoid `inf` in results
    return out/np.max(out)

r = 100001      # +1 for convenience
                # (we apply an even number of functions) 
xs = [np.random.rand(2,2) for _ in range(r)]
ps = [0]*(r-1)
fs = repeat(add_or_dot)
foo = ordered_reduce(None, ps, fs, precompute=True)
rpn_seq = [0] + [x for i, f in zip(range(1,r), repeat(add_or_dot)) for x in (i,f)]

%timeit reduce(add_or_dot, xs)
1 loops, best of 3: 894 ms per loop
%timeit foo(xs)
1 loops, best of 3: 2.72 s per loop
%timeit rpn(xs, rpn_seq)
1 loops, best of 3: 1.17 s per loop

Here, rpn was roughly 25% slower than reduce and more than 2 times faster than ordered_reduce.