Search code examples
pythonnumpyperformancefor-looprank

Writing group ranking without for-loops in Python + NumPy


My goal is two create a function that takes in two input arrays and outputs ranks withing groups. More specifically, the first input to the function is an array of scores (scores, and the second input array is an array of group sizes (groups, in such a way that the sum of the group sizes equals the total amount of scores, and the scores array can be partitioned into groups doing something like np.split(scores, np.cumsum(groups)). The output ranking should be 1 for the point with the highest score within its group, two for the second highest, and so on. I have implemented this as follows (including two test cases):

import numpy as np


def assign_ranks(scores, groups):
    # Create an array to store the ranks
    ranks = np.zeros(len(scores), dtype=int)

    # Iterate through groups
    start = 0
    for group_size in groups:
        end = start + group_size
        ranks[start:end] = (-scores[start:end]).argsort().argsort() + 1
        start = end

    return ranks


def test():
    # Test case 1
    scores1 = np.array([0.1, 0.5, 0.3, 0.4, 0.5])
    groups1 = np.array([3, 2])
    result1 = assign_ranks(scores1, groups1)
    print("Test case 1 result:", result1)  # Expected: [3 1 2 2 1]
    assert np.all(result1 == np.array([3, 1, 2, 2, 1]))

    # Test case 2
    scores2 = np.array([0.7, 0.3, 0.6, 0.1, 0.2, 0.5, 0.8, 0.9, 0.4])
    groups2 = np.array([4, 3, 2])
    result2 = assign_ranks(scores2, groups2)
    print("Test case 2 result:", result2)  # Expected: [1 3 2 4 3 2 1 1 2]
    assert np.all(result2 == np.array([1, 3, 2, 4, 3, 2, 1, 1, 2]))


if __name__ == "__main__":
    test()

While this works correctly, I was wondering if it would be possible to implement this without using for-loops, potentially improving performance. I have made several attemps with no success. Any suggestions?

I have tried other approaches, leveraging np.lexsort and np.split, for example. Nothing worked.


Solution

  • If all your values are positive you can offset them by a base value corresponding to a multiple of the group index they are in. Then use argsort to obtain the global positions of items with the offsets. This will keep the members of a given group together relative to the group index. Then convert the global indexes back into group relative positions and invert the order to obtain ranks:

    def assign_ranks(scores,groups):
        offsets   = np.repeat(np.arange(groups.size),groups) * np.max(scores)
        order     = np.argsort(scores + offsets)
        groupBase = np.repeat(np.cumsum(groups),groups)
        return groupBase - order 
    

    If you have negative values, you'll need to compute a larger offset (than * np.max(scores)) to ensure that there is no overlap across groups in the resulting values.

    You could do this using * 2 * np.max(np.abs(scores))

    or * (np.max(scores) - np.min(scores))