Search code examples
pythonnumpyoptimization

Accumulate an array of coefficients based on an array of non unique items in Numpy


I have a numpy.array items of N~10^7 integers and a numpy.array coefficients of N floats. items contains repeated entries, and only ~sqrt(N) unique entries.

I would like an array of unique entries unique_items and an array of accumulated coefficients unique_coefficients such that unique_coefficients[i] is the sum over j of coefficients[j] such that items[j]=unique_items[i].

For example if we have

items = [1,1,2,2,3,4]
coefficients = [1,2,3,4,5,6]

then

unique_items = [1,2,3,4]
unique_coefficients = [3,7,5,6]

The first entry of unique_coefficients is the sum of the 2 first entries of coefficients because the two first entries of items are 1 etc. Note that items is not necessarily sorted.

For now, I first compute a dict where each key is a unique element of items, and each value is a list of indices of the element in items, then I sum over the indices stored in this dict:

d = {}
for index, item in enumerate(items):
    if not item in d:
        d[item] = []
    d[item].append(index) 
unique_items = d.keys()
unique_coefficients = [np.sum(coefficients[i]) for i in d.values()]

Is there a way to avoid doing this python for loop and stay in numpy instead ? This for loop is where my code spends the most time.


Solution

  • Assuming the input items is sorted, get the unique values, then aggregate with add.reduceat:

    items = np.array([1,1,2,2,3,4])
    coefficients = np.array([1,2,3,4,5,6])
    
    unique_items, idx = np.unique(items, return_index=True)
    # array([1, 2, 3, 4])
    unique_coefficients = np.add.reduceat(coefficients, idx)
    # array([3, 7, 5, 6])
    

    If the input is not necessarily sorted, first add a sorting step with argsort:

    items = np.array([1,2,2,3,4,1])
    coefficients = np.array([1,3,4,5,6,2])
    
    order = np.argsort(items)
    
    unique_items, idx = np.unique(items[order], return_index=True)
    # array([1, 2, 3, 4])
    unique_coefficients = np.add.reduceat(coefficients[order], idx)
    # array([3, 7, 5, 6])
    

    You might also consider 's groupby.sum:

    import pandas as pd
    
    out = pd.Series(coefficients, index=items).groupby(level=0).sum()
    
    1    3
    2    7
    3    5
    4    6
    dtype: int64
    

    Finally, a simple pure python approach could be:

    items = [1,1,2,2,3,4]
    coefficients = [1,2,3,4,5,6]
    
    out = {}
    for i, c in zip(items, coefficients):
        out[i] = out.get(i, 0)+c
    
    # {1: 3, 2: 7, 3: 5, 4: 6}