I have a an array of shape (N, 3) like this one for instance:
arr = np.array(
[
[0,1,1],
[0,1,2],
[2,2,2]
]
)
I want to group this array by the two first columns to obtain an array like this
grouped_arr = np.array([[[0,1,1], [0,1,2]], [[2,2,2]]])
Finally I would like to get only one element by group while the third column would be the mean of the group third column
final_array = np.array([[0,1,1.5], [2,2,2]])
I am trying something but not sure if it's correct and if it's an efficient way to achieve it:
import numpy as np
arr = np.array([[0,1,1], [0,1,2], [2,2,2]])
stacked = np.vstack((arr[:,0], arr[:,1])).transpose()
uniques_values = np.unique(stacked, axis=0)
groups = []
for v in uniques_values:
groups.append(arr[v])
final_arr = []
for group in groups:
mean = np.mean(group[:,2], axis=0)
final_arr.append(np.array([group[0][0], group[0][1], mean]))
print(final_arr)
>>>[array([0. , 1. , 1.5]), array([2., 2., 2.])]
I am looking for a reliable and efficient suggestion. In my real data the dtype is float
You can't have a ragged array in numpy. That's what lists are for. Instead of separating the data into groups, you can identify the locations of the groups, and operate them using the aggregation functions that ufuncs provide.
Specifically, you can use np.add.reduceat
to construct the means once you identify the indices of the group starts. Groups start when a either number in the first two columns does not match what is in the previous row:
mask = (arr[:-1, :2] != arr[1:, :2]).any(1)
group_indices = np.flatnonzero(np.r_[True, mask])
Now taking the mean is straightforward:
sums = np.add.reduceat(arr[:, -1], group_indices, axis=0)
lengths = np.diff(np.r_[group_indices, len(arr)])
means = sums / lengths
To construct the output, you just need the first two columns of the array at the start indices of each group concatenated with the means (transposed to be a column):
result = np.concatenate((arr[group_indices, :2], means[:, None]), axis=1)