Search code examples
pythonnumpygroup-bymedian

Group by median for Numpy (without Pandas)


Is it possible to calculate a median of one column based on groupings of another column without using pandas (and keeping my data in a Numpy array)?

For example, if this is the input:

arr = np.array([[0,1],[0,2],[0,3],[1,4],[1,5],[1,6]])

I want this as the output (using first column to group, and then taking the median of the second column:

ans = np.array([[0,2],[1,5]])

Solution

  • If you want to avoid using Pandas for some reason, here is one possibility to do that computation. Note that, in the general case, the median is not an integer value (unless you round it or floor it), because for even-size groups it will be the average of the two middlemost elements, so you cannot have both the integer group id and median value in a single regular array (although you could in a structured array).

    import numpy as np
    
    def grouped_median(group, value):
        # Sort by group and value
        s = np.lexsort([value, group])
        arr2 = arr[s]
        group2 = group[s]
        value2 = value[s]
        # Look for group boundaries
        w = np.flatnonzero(np.diff(group2, prepend=group2[0] - 1, append=group2[-1] + 1))
        # Size of each group
        wd = np.diff(w)
        # Mid points of each group
        m1 = w[:-1] + wd // 2
        m2 = m1 - 1 + (wd % 2)
        # Group id
        group_res = group2[m1]
        # Group median value
        value_res = (value2[m1] + value2[m2]) / 2  # Use `// 2` or round for int result
        return group_res, value_res
    
    # Test
    arr = np.array([[0, 1], [0, 2], [0, 3], [1, 4], [1, 5], [1, 6]])
    group_res, value_res = grouped_median(arr[:, 0], arr[:, 1])
    # Print
    for g, v in zip(group_res, value_res):
        print(g, v)
        # 0 2.0
        # 1 5.0
    # As a structured array
    res = np.empty(group_res.shape, dtype=[('group', group_res.dtype),
                                           ('median', value_res.dtype)])
    res['group'] = group_res
    res['median'] = value_res
    print(res)
    # [(0, 2.) (1, 5.)]