Search code examples
pythonrandomsamplingdownsampling

Python: downsample a population using frequency data


Given a data series representing frequencies of elements in a population, what would be the easiest way to downsample it ?

The following population: pop = ['a', 'b', 'a', 'c', 'c', 'd', 'c', 'a', 'a', 'b', 'a']

Can be summeriezed as: freq = {'a': 5, 'c': 3, 'b': 2, 'd': 1}

Using the simple: from collections import Counter; Counter(pop)

To randomly downsample that population to 5 individuals I can do:

>>> from random import sample
>>> from collections import Counter
>>> pop = ['a', 'b', 'a', 'c', 'c', 'd', 'c', 'a', 'a', 'b', 'a']
>>> smaller_pop = sample(pop, 5)
>>> smaller_freq = Counter(smaller_pop)
>>> print smaller_freq
Counter({'a': 3, 'c': 1, 'b': 1})

But I'm searching for a way to do this directly from the freq information without having to build the pop list. You will agree that proceeding like this should not be necessary:

>>> from random import sample
>>> from collections import Counter
>>> flatten = lambda x: [item for sublist in x for item in sublist]
>>> freq = {'a': 5, 'c': 3, 'b': 2, 'd': 1}
>>> pop = flatten([[k]*v for k,v in freq.items()])
>>> smaller_pop = sample(pop, 5)
>>> smaller_freq = Counter(smaller_pop)
>>> print smaller_freq
Counter({'a': 2, 'c': 2, 'd': 1})

For memory considerations and speed requirements, I would like to avoid placing in memory the pop list. This can surely be done using some type of weighted random generator.


Solution

  • Here is a basic algorithm that downsamples frequencies:

    import random
    import bisect
    import collections
    
    def downsample(freq, n):
        cumsums = []
        total = 0
        choices, weights = zip(*freq.items())
        for weight in weights:
            total += weight
            cumsums.append(total)
        assert 0 <= n <= total
        result = collections.Counter()
        for _ in range(n):
            rnd = random.uniform(0, total)
            i = bisect.bisect(cumsums, rnd)
            result[choices[i]] += 1
            cumsums = [c if idx<i else c-1 for idx, c in enumerate(cumsums)]
            total -= 1
        return result
    
    freq = {'a': 5, 'c': 3, 'b': 2, 'd': 1}
    print(downsample(freq, 5))
    

    which prints results like

    Counter({'c': 2, 'a': 1, 'b': 1, 'd': 1})