I have the following example array of x-y coordinate pairs:
A = np.array([[0.33703753, 3.],
[0.90115394, 5.],
[0.91172016, 5.],
[0.93230994, 3.],
[0.08084283, 3.],
[0.71531777, 2.],
[0.07880787, 3.],
[0.03501083, 4.],
[0.69253184, 4.],
[0.62214452, 3.],
[0.26953094, 1.],
[0.4617873 , 3.],
[0.6495549 , 0.],
[0.84531478, 4.],
[0.08493308, 5.]])
My goal is to reduce this to an array with six rows by taking the average of the x-values for each y-value, like so:
array([[0.6495549 , 0. ],
[0.26953094, 1. ],
[0.71531777, 2. ],
[0.41882167, 3. ],
[0.52428582, 4. ],
[0.63260239, 5. ]])
Currently I am achieving this by converting to a pandas dataframe, performing the calculation, and converting back to a numpy array:
>>> df = pd.DataFrame({'x':A[:, 0], 'y':A[:, 1]})
>>> df.groupby('y').mean().reset_index()
y x
0 0.0 0.649555
1 1.0 0.269531
2 2.0 0.715318
3 3.0 0.418822
4 4.0 0.524286
5 5.0 0.632602
Is there a way to perform this calculation using numpy, without having to resort to the pandas library?
Here's a completely vectorized solution that only uses numpy methods and no python iteration:
sort_indices = np.argsort(A[:, 1])
unique_y, unique_indices, group_count = np.unique(A[sort_indices, 1], return_index=True, return_counts=True)
Once we have the indices and counts of all the unique elements, we can use the np.ufunc.reduceat
method to collect the results of np.add
for each group, and then divide by their counts to get the mean:
group_sum = np.add.reduceat(A[sort_indices, :], unique_indices, axis=0)
group_mean = group_sum / group_count[:, None]
# array([[0.6495549 , 0. ],
# [0.26953094, 1. ],
# [0.71531777, 2. ],
# [0.41882167, 3. ],
# [0.52428582, 4. ],
# [0.63260239, 5. ]])
Comparing this solution with the other answers here (Code at tio.run) for
A
contains 10k rows, with A[:, 1]
containing N groups, N varies from 1 to 10k
A
contains N rows (N varies from 1 to 10k), with A[:, 1]
containing min(N, 1000)
groups
Observations: The numpy-only solutions (Dani's and mine) win easily -- they are significantly faster than the pandas approach (possibly since the time taken to create the dataframe is an overhead that doesn't exist for the former).
The pandas solution is slower than the python+numpy solutions (Jaimu's and mine) for smaller arrays, since it's faster to just iterate in python and get it over with than to create a dataframe first, but these solutions become much slower than pandas as the array size or number of groups increases.
Note: The previous version of this answer iterated over the groups as returned by the accepted answer to Is there any numpy group by function? and individually calculated the mean:
First, we need to sort the array on the column you want to group by
A_s = A[A[:, 1].argsort(), :]
Then, run that snippet. np.split
splits its first argument at the indices given by the second argument.
unique_elems, unique_indices = np.unique(A_s[:, 1], return_index=True)
# (array([0., 1., 2., 3., 4., 5.]), array([ 0, 1, 2, 3, 9, 12]))
split_indices = unique_indices[1:] # No need to split at the first index
groups = np.split(A_s, split_indices)
# [array([[0.6495549, 0. ]]),
# array([[0.26953094, 1. ]]),
# array([[0.71531777, 2. ]]),
# array([[0.33703753, 3. ],
# [0.93230994, 3. ],
# [0.08084283, 3. ],
# [0.07880787, 3. ],
# [0.62214452, 3. ],
# [0.4617873 , 3. ]]),
# array([[0.03501083, 4. ],
# [0.69253184, 4. ],
# [0.84531478, 4. ]]),
# array([[0.90115394, 5. ],
# [0.91172016, 5. ],
# [0.08493308, 5. ]])]
Now, groups
is a list containing multiple np.array
s. Iterate over the list and mean
each array:
means = np.zeros((len(groups), groups[0].shape[1]))
for i, grp in enumerate(groups):
means[i, :] = grp.mean(axis=0)
# array([[0.6495549 , 0. ],
# [0.26953094, 1. ],
# [0.71531777, 2. ],
# [0.41882167, 3. ],
# [0.52428582, 4. ],
# [0.63260239, 5. ]])