Search code examples
pythonnumpyaverage

Numpy - calculate mean of groups


I have a an array of shape (N, 3) like this one for instance:

arr = np.array(
  [
    [0,1,1], 
    [0,1,2], 
    [2,2,2]
  ]
)

I want to group this array by the two first columns to obtain an array like this

grouped_arr = np.array([[[0,1,1], [0,1,2]], [[2,2,2]]])

Finally I would like to get only one element by group while the third column would be the mean of the group third column

final_array = np.array([[0,1,1.5], [2,2,2]])

I am trying something but not sure if it's correct and if it's an efficient way to achieve it:

import numpy as np

arr = np.array([[0,1,1], [0,1,2], [2,2,2]])

stacked = np.vstack((arr[:,0], arr[:,1])).transpose()
uniques_values = np.unique(stacked, axis=0)

groups = []
for v in uniques_values:
    groups.append(arr[v])

final_arr = []
for group in groups:
    mean = np.mean(group[:,2], axis=0)
    final_arr.append(np.array([group[0][0], group[0][1], mean]))

print(final_arr)

>>>[array([0. , 1. , 1.5]), array([2., 2., 2.])]

I am looking for a reliable and efficient suggestion. In my real data the dtype is float


Solution

  • You can't have a ragged array in numpy. That's what lists are for. Instead of separating the data into groups, you can identify the locations of the groups, and operate them using the aggregation functions that ufuncs provide.

    Specifically, you can use np.add.reduceat to construct the means once you identify the indices of the group starts. Groups start when a either number in the first two columns does not match what is in the previous row:

    mask = (arr[:-1, :2] != arr[1:, :2]).any(1)
    group_indices = np.flatnonzero(np.r_[True, mask])
    

    Now taking the mean is straightforward:

    sums = np.add.reduceat(arr[:, -1], group_indices, axis=0)
    lengths = np.diff(np.r_[group_indices, len(arr)])
    means = sums / lengths
    

    To construct the output, you just need the first two columns of the array at the start indices of each group concatenated with the means (transposed to be a column):

    result = np.concatenate((arr[group_indices, :2], means[:, None]), axis=1)