I have a numpy.array items
of N~10^7 integers and a numpy.array coefficients
of N floats.
items
contains repeated entries, and only ~sqrt(N) unique entries.
I would like an array of unique entries unique_items
and an array of accumulated coefficients unique_coefficients
such that unique_coefficients[i]
is the sum over j
of coefficients[j]
such that items[j]=unique_items[i]
.
For example if we have
items = [1,1,2,2,3,4]
coefficients = [1,2,3,4,5,6]
then
unique_items = [1,2,3,4]
unique_coefficients = [3,7,5,6]
The first entry of unique_coefficients
is the sum of the 2 first entries of coefficients
because the two first entries of items
are 1 etc. Note that items
is not necessarily sorted.
For now, I first compute a dict where each key is a unique element of items
, and each value is a list of indices of the element in items
, then I sum over the indices stored in this dict:
d = {}
for index, item in enumerate(items):
if not item in d:
d[item] = []
d[item].append(index)
unique_items = d.keys()
unique_coefficients = [np.sum(coefficients[i]) for i in d.values()]
Is there a way to avoid doing this python for loop and stay in numpy instead ? This for loop is where my code spends the most time.
Assuming the input items
is sorted, get the unique
values, then aggregate with add.reduceat
:
items = np.array([1,1,2,2,3,4])
coefficients = np.array([1,2,3,4,5,6])
unique_items, idx = np.unique(items, return_index=True)
# array([1, 2, 3, 4])
unique_coefficients = np.add.reduceat(coefficients, idx)
# array([3, 7, 5, 6])
If the input is not necessarily sorted, first add a sorting step with argsort
:
items = np.array([1,2,2,3,4,1])
coefficients = np.array([1,3,4,5,6,2])
order = np.argsort(items)
unique_items, idx = np.unique(items[order], return_index=True)
# array([1, 2, 3, 4])
unique_coefficients = np.add.reduceat(coefficients[order], idx)
# array([3, 7, 5, 6])
You might also consider pandas's groupby.sum
:
import pandas as pd
out = pd.Series(coefficients, index=items).groupby(level=0).sum()
1 3
2 7
3 5
4 6
dtype: int64
Finally, a simple pure python approach could be:
items = [1,1,2,2,3,4]
coefficients = [1,2,3,4,5,6]
out = {}
for i, c in zip(items, coefficients):
out[i] = out.get(i, 0)+c
# {1: 3, 2: 7, 3: 5, 4: 6}