Search code examples
pythonnumpymatrixgroupingaverage

Group and average NumPy matrix


Say I have an arbitrary numpy matrix that looks like this:

arr = [[  6.0   12.0   1.0]
       [  7.0   9.0   1.0]
       [  8.0   7.0   1.0]
       [  4.0   3.0   2.0]
       [  6.0   1.0   2.0]
       [  2.0   5.0   2.0]
       [  9.0   4.0   3.0]
       [  2.0   1.0   4.0]
       [  8.0   4.0   4.0]
       [  3.0   5.0   4.0]]

What would be an efficient way of averaging rows that are grouped by their third column number?

The expected output would be:

result = [[  7.0  9.33  1.0]
          [  4.0  3.0  2.0]
          [  9.0  4.0  3.0]
          [  4.33  3.33  4.0]]

Solution

  • You can do:

    for x in sorted(np.unique(arr[...,2])):
        results.append([np.average(arr[np.where(arr[...,2]==x)][...,0]), 
                        np.average(arr[np.where(arr[...,2]==x)][...,1]),
                        x])
    

    Testing:

    >>> arr
    array([[  6.,  12.,   1.],
           [  7.,   9.,   1.],
           [  8.,   7.,   1.],
           [  4.,   3.,   2.],
           [  6.,   1.,   2.],
           [  2.,   5.,   2.],
           [  9.,   4.,   3.],
           [  2.,   1.,   4.],
           [  8.,   4.,   4.],
           [  3.,   5.,   4.]])
    >>> results=[]
    >>> for x in sorted(np.unique(arr[...,2])):
    ...     results.append([np.average(arr[np.where(arr[...,2]==x)][...,0]), 
    ...                     np.average(arr[np.where(arr[...,2]==x)][...,1]),
    ...                     x])
    ... 
    >>> results
    [[7.0, 9.3333333333333339, 1.0], [4.0, 3.0, 2.0], [9.0, 4.0, 3.0], [4.333333333333333, 3.3333333333333335, 4.0]]
    

    The array arr does not need to be sorted, and all the intermediate arrays are views (ie, not new arrays of data). The average is calculated efficiently directly from those views.

    Or, for a pure numpy solution:

    groups = arr[:,2].copy()
    
    _ndx = np.argsort(groups)
    _id, _pos, grp_count  = np.unique(groups[_ndx], 
                    return_index=True, 
                    return_counts=True)
    
    grp_sum = np.add.reduceat(arr[_ndx], _pos, axis=0)
    grp_mean = grp_sum / grp_count[:,None]  
    
    >>> grp_mean
    array([[7.        , 9.33333333, 1.        ],
           [4.        , 3.        , 2.        ],
           [9.        , 4.        , 3.        ],
           [4.33333333, 3.33333333, 4.        ]])