Search code examples
pythonnumpyvectorization

Finding the average of the x component of an array of coordinates, based on the y component


I have the following example array of x-y coordinate pairs:

A = np.array([[0.33703753, 3.],
              [0.90115394, 5.],
              [0.91172016, 5.],
              [0.93230994, 3.],
              [0.08084283, 3.],
              [0.71531777, 2.],
              [0.07880787, 3.],
              [0.03501083, 4.],
              [0.69253184, 4.],
              [0.62214452, 3.],
              [0.26953094, 1.],
              [0.4617873 , 3.],
              [0.6495549 , 0.],
              [0.84531478, 4.],
              [0.08493308, 5.]])

My goal is to reduce this to an array with six rows by taking the average of the x-values for each y-value, like so:

array([[0.6495549 , 0.        ],
       [0.26953094, 1.        ],
       [0.71531777, 2.        ],
       [0.41882167, 3.        ],
       [0.52428582, 4.        ],
       [0.63260239, 5.        ]])

Currently I am achieving this by converting to a pandas dataframe, performing the calculation, and converting back to a numpy array:

>>> df = pd.DataFrame({'x':A[:, 0], 'y':A[:, 1]})
>>> df.groupby('y').mean().reset_index()
     y         x
0  0.0  0.649555
1  1.0  0.269531
2  2.0  0.715318
3  3.0  0.418822
4  4.0  0.524286
5  5.0  0.632602

Is there a way to perform this calculation using numpy, without having to resort to the pandas library?


Solution

  • Here's a completely vectorized solution that only uses numpy methods and no python iteration:

    sort_indices = np.argsort(A[:, 1])
    unique_y, unique_indices, group_count  = np.unique(A[sort_indices, 1], return_index=True, return_counts=True)
    

    Once we have the indices and counts of all the unique elements, we can use the np.ufunc.reduceat method to collect the results of np.add for each group, and then divide by their counts to get the mean:

    group_sum = np.add.reduceat(A[sort_indices, :], unique_indices, axis=0)
    
    group_mean = group_sum / group_count[:, None]
    # array([[0.6495549 , 0.        ],
    #        [0.26953094, 1.        ],
    #        [0.71531777, 2.        ],
    #        [0.41882167, 3.        ],
    #        [0.52428582, 4.        ],
    #        [0.63260239, 5.        ]])
    

    Benchmarks:

    Comparing this solution with the other answers here (Code at tio.run) for

    1. A contains 10k rows, with A[:, 1] containing N groups, N varies from 1 to 10k Timing for different methods with 10k rows, N groups

    2. A contains N rows (N varies from 1 to 10k), with A[:, 1] containing min(N, 1000) groups Timing for different methods with N rows, 1k groups

    Observations: The numpy-only solutions (Dani's and mine) win easily -- they are significantly faster than the pandas approach (possibly since the time taken to create the dataframe is an overhead that doesn't exist for the former).

    The pandas solution is slower than the python+numpy solutions (Jaimu's and mine) for smaller arrays, since it's faster to just iterate in python and get it over with than to create a dataframe first, but these solutions become much slower than pandas as the array size or number of groups increases.


    Note: The previous version of this answer iterated over the groups as returned by the accepted answer to Is there any numpy group by function? and individually calculated the mean:

    First, we need to sort the array on the column you want to group by

    A_s = A[A[:, 1].argsort(), :]
    

    Then, run that snippet. np.split splits its first argument at the indices given by the second argument.

    unique_elems, unique_indices = np.unique(A_s[:, 1], return_index=True) 
    # (array([0., 1., 2., 3., 4., 5.]), array([ 0,  1,  2,  3,  9, 12])) 
    
    split_indices = unique_indices[1:] # No need to split at the first index
    
    groups = np.split(A_s, split_indices)
    # [array([[0.6495549, 0.       ]]),
    #  array([[0.26953094, 1.        ]]),
    #  array([[0.71531777, 2.        ]]),
    #  array([[0.33703753, 3.        ],
    #         [0.93230994, 3.        ],
    #         [0.08084283, 3.        ],
    #         [0.07880787, 3.        ],
    #         [0.62214452, 3.        ],
    #         [0.4617873 , 3.        ]]),
    #  array([[0.03501083, 4.        ],
    #         [0.69253184, 4.        ],
    #         [0.84531478, 4.        ]]),
    #  array([[0.90115394, 5.        ],
    #         [0.91172016, 5.        ],
    #         [0.08493308, 5.        ]])]
    

    Now, groups is a list containing multiple np.arrays. Iterate over the list and mean each array:

    means = np.zeros((len(groups), groups[0].shape[1]))
    for i, grp in enumerate(groups):
        means[i, :] = grp.mean(axis=0)
    
    # array([[0.6495549 , 0.        ],
    #        [0.26953094, 1.        ],
    #        [0.71531777, 2.        ],
    #        [0.41882167, 3.        ],
    #        [0.52428582, 4.        ],
    #        [0.63260239, 5.        ]])