Accumulate an array of coefficients based on an array of non unique items in Numpy

I have a numpy.array items of N~10^7 integers and a numpy.array coefficients of N floats. items contains repeated entries, and only ~sqrt(N) unique entries.

I would like an array of unique entries unique_items and an array of accumulated coefficients unique_coefficients such that unique_coefficients[i] is the sum over j of coefficients[j] such that items[j]=unique_items[i].

For example if we have

items = [1,1,2,2,3,4]
coefficients = [1,2,3,4,5,6]

then

unique_items = [1,2,3,4]
unique_coefficients = [3,7,5,6]

The first entry of unique_coefficients is the sum of the 2 first entries of coefficients because the two first entries of items are 1 etc. Note that items is not necessarily sorted.

For now, I first compute a dict where each key is a unique element of items, and each value is a list of indices of the element in items, then I sum over the indices stored in this dict:

d = {}
for index, item in enumerate(items):
    if not item in d:
        d[item] = []
    d[item].append(index) 
unique_items = d.keys()
unique_coefficients = [np.sum(coefficients[i]) for i in d.values()]

Is there a way to avoid doing this python for loop and stay in numpy instead ? This for loop is where my code spends the most time.

Solution

Assuming the input items is sorted, get the unique values, then aggregate with add.reduceat:

items = np.array([1,1,2,2,3,4])
coefficients = np.array([1,2,3,4,5,6])

unique_items, idx = np.unique(items, return_index=True)
# array([1, 2, 3, 4])
unique_coefficients = np.add.reduceat(coefficients, idx)
# array([3, 7, 5, 6])

If the input is not necessarily sorted, first add a sorting step with argsort:

items = np.array([1,2,2,3,4,1])
coefficients = np.array([1,3,4,5,6,2])

order = np.argsort(items)

unique_items, idx = np.unique(items[order], return_index=True)
# array([1, 2, 3, 4])
unique_coefficients = np.add.reduceat(coefficients[order], idx)
# array([3, 7, 5, 6])

You might also consider pandas's groupby.sum:

import pandas as pd

out = pd.Series(coefficients, index=items).groupby(level=0).sum()

1    3
2    7
3    5
4    6
dtype: int64

Finally, a simple pure python approach could be:

items = [1,1,2,2,3,4]
coefficients = [1,2,3,4,5,6]

out = {}
for i, c in zip(items, coefficients):
    out[i] = out.get(i, 0)+c

# {1: 3, 2: 7, 3: 5, 4: 6}