Search code examples
pythonrandomweighted

Weighted random choice from a variable length text file


I have a series of text files of various (and often changing) lengths. The data stored in the file is in a particular order from most common (top) to least (bottom). I want to randomly select a line from the file weighted towards the top entries - for example, if there are 322 entries in a file, line 1 would be 322 times more likely to be chosen than line 322.

I've been appending the contents of the file to a list to get the length via the len function and then approaching it as a math problem, but I'm wondering (hoping) Python has a cleverer way to accomplish this?


Solution

  • The accepted answer does not appear to be aligned with the OP's requirements as written (although it might actually be so) so here is another answer that approaches the general problem of randomly selecting a line from a file with weighted probabilities. This comes from the random module examples in the Python 3 documentation.

    In this case, line 1 of a file is to be selected with greater probability than the last line, and with reducing probability for intervening lines, so our weights would be range(n, 0, -1) where n is the number of lines in the file, e.g. if there were 5 lines in the file, then the weights would be [5, 4, 3, 2, 1] and this would correspond to probabilities of:

    weights = range(5, 0, -1)
    total_weights = float(sum(weights))
    probabilities = [w/total_weights for w in weights]
    >>> [round(p, 5) for p in probabilities]    # rounded for readability
    [0.33333, 0.26667, 0.2, 0.13333, 0.06667]
    

    So the first line has probability 5 times greater than the last line, with reducing probability for each line.

    Next we need to construct a cumulative distribution based on the weights, select a random value within that distribution, locate the random value within the distribution, and use that to retrieve a line from the file. Here is some code that does that.

    import bisect
    import random
    try:
        from itertools import accumulate     # Python >= 3.2
    except ImportError:
        def accumulate(weights):
            accumulator = 0
            for w in weights:
                accumulator += w
                yield accumulator
    
    def count(iterable):
        return sum(1 for elem in iterable)
    
    def get_nth(iterable, n):
        assert isinstance(n, int), "n must be an integer, got %r" % type(n)
        assert n > 0, "n must be greater than 0, got %r" % n
        for i, elem in enumerate(iterable, 1):
            if i == n:
                return elem
    
    def weighted_select(filename):
        with open(filename) as f:
            n = count(f)
            if n == 0:
                return None
    
            # set up cumulative distribution
            weights = range(n, 0, -1)
            cumulative_dist = list(accumulate(weights))
    
            # select line number
            x = random.random() * cumulative_dist[-1]
            selected_line = bisect.bisect(cumulative_dist, x)
    
            # retrieve line from file
            f.seek(0)
            return get_nth(f, selected_line + 1)    # N.B. +1 for nth line
    

    This uses weights according to my interpretation of the question. It's easy enough to adapt this to other weights, e.g. if you wanted a weighted select with city population as the weights, you'd just change weights = range(n, 0, -1) to a list of populations corresponding to each line in the file.