Search code examples
pythonpython-3.xgrouping

Python turn a list of numbers into a Counter of ranges


Given a list of numbers (ints or floats, usually floats), and a number step, what is the best way to turn the list into a Counter, so that the keys are two-element tuples, each a multiple of step, and the values are the counts of numbers that fall into the range greater than the first number and less than the second number for the respective key?

I have written a minimal reproducible example, to illustrate what I intend to achieve. I already found a solution, but I don't think it is the most efficient way.

import random
from bisect import bisect
from collections import Counter

def get_sample(n=256, high=20):
    return [random.random()*high for _ in range(n)]

sample = get_sample()

def analyze_sample(sample, step):
    ceiling = round(max(sample) / step) + 1
    steps = [i*step for i in range(ceiling)]
    bands = [(a, b) for a, b in zip(steps, steps[1:])]
    result = Counter()
    for n in sample:
        result[bands[bisect(steps, n)-1]] += 1
    
    return Counter({k: v for k, v in sorted(result.items())})

analyze_sample(sample, 0.5)

What is a better way?


Update

I just turned my code into a linear search, it is much faster, but still slow.

def linear_analyze_sample(sample, step):
    sample = sorted(sample)
    ceiling = round(sample[-1] / step) + 1
    steps = [i*step for i in range(ceiling)]
    bands = [(a, b) for a, b in zip(steps, steps[1:])]
    result = dict()
    i = 0
    it = iter(sample)
    n = next(it)
    for step in steps[1:]:
        count = 0
        if n == 1e309:
            break
        while n <= step:
            count += 1
            n = next(it, 1e309)
        result[bands[i]] = count
        i += 1
    
    return {k: v for k, v in sorted(result.items())}
In [120]: linear_analyze_sample(sample, 0.5) == analyze_sample(sample, 0.5)
Out[120]: True

In [121]: %timeit analyze_sample(sample, 0.5)
179 µs ± 2.25 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

In [122]: %timeit linear_analyze_sample(sample, 0.5)
62.5 µs ± 380 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

Solution

  • You could compute the ranges directly using the floor of the division by step to get a chunk number, and multiplying it back by step to get the range start and stop values:

    result = Counter( (step*int(n/step),step*int(n/step)+step) for n in sample)
    

    In order to minimize the number of redundant computations, you may want to build the counter with chunk numbers first, and convert them to ranges at the end:

    chunks = Counter( int(n/step) for n in sample)
    result = { (k*step,(k+1)*step):c for k,c in chunks.items() }