Search code examples
pythonnumpyprobabilityloss-function

Python algorithm for probability that X >= Y where distributions for X and Y are based on 2 arrays of sampled points


I want to estimate the probability that a randomly drawn datapoint from array x will be greater than or equal to a randomly drawn point from array y. I want to do this by comparing all possible pairs of values.

I think that a naive implementation would look like this:

def probability_x_gte_y(array_x, array_y):
    gte_counter = 0
    n_comparisons = len(array_x) * len(array_y)
    for vx in array_x:
        for vy in array_y:
            if vx >= vy:
                gte_counter += 1
    return gte_counter / n_comparisons

But I believe that there is a more efficient way to calculate this, especially given that the distribution of the two sets of points in array_x and array_y are likely to be quite well separated (put another way the overlap between the two 1-D arrays is likely to be small relative to the total range of points covered.)


Solution

  • A much faster implementation is to sort one of the array so it is possible to find faster the number of values greater than a given one thanks to a binary search. This implementation runs in O(n log n) while the original runs in O(n * n).

    def probability_x_gte_y_opt2(array_x, array_y):
        n_comparisons = len(array_x) * len(array_y)
        sorted_x = np.sort(array_x)
        gte_counter = n_comparisons - np.searchsorted(sorted_x, array_y).sum()
        return gte_counter / n_comparisons
    

    On random arrays of size 5000, this is about 3890 times faster on my machine (2.69s vs 0.69ms)!

    Note that this is possible to write an algorithm running in O(n) time: you can use a radix sort on the two array followed by a kind of custom counting merge between the two sorted array. However, Numpy does not implement a radix sort yet and a fast counting merge cannot be easily implemented with Numpy.