Given a list
of numbers (int
s or float
s, usually float
s), and a number step
, what is the best way to turn the list
into a Counter
, so that the keys are two-element tuple
s, each a multiple of step
, and the values are the counts of numbers that fall into the range greater than the first number and less than the second number for the respective key?
I have written a minimal reproducible example, to illustrate what I intend to achieve. I already found a solution, but I don't think it is the most efficient way.
import random
from bisect import bisect
from collections import Counter
def get_sample(n=256, high=20):
return [random.random()*high for _ in range(n)]
sample = get_sample()
def analyze_sample(sample, step):
ceiling = round(max(sample) / step) + 1
steps = [i*step for i in range(ceiling)]
bands = [(a, b) for a, b in zip(steps, steps[1:])]
result = Counter()
for n in sample:
result[bands[bisect(steps, n)-1]] += 1
return Counter({k: v for k, v in sorted(result.items())})
analyze_sample(sample, 0.5)
What is a better way?
I just turned my code into a linear search, it is much faster, but still slow.
def linear_analyze_sample(sample, step):
sample = sorted(sample)
ceiling = round(sample[-1] / step) + 1
steps = [i*step for i in range(ceiling)]
bands = [(a, b) for a, b in zip(steps, steps[1:])]
result = dict()
i = 0
it = iter(sample)
n = next(it)
for step in steps[1:]:
count = 0
if n == 1e309:
break
while n <= step:
count += 1
n = next(it, 1e309)
result[bands[i]] = count
i += 1
return {k: v for k, v in sorted(result.items())}
In [120]: linear_analyze_sample(sample, 0.5) == analyze_sample(sample, 0.5)
Out[120]: True
In [121]: %timeit analyze_sample(sample, 0.5)
179 µs ± 2.25 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
In [122]: %timeit linear_analyze_sample(sample, 0.5)
62.5 µs ± 380 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
You could compute the ranges directly using the floor of the division by step to get a chunk number, and multiplying it back by step to get the range start and stop values:
result = Counter( (step*int(n/step),step*int(n/step)+step) for n in sample)
In order to minimize the number of redundant computations, you may want to build the counter with chunk numbers first, and convert them to ranges at the end:
chunks = Counter( int(n/step) for n in sample)
result = { (k*step,(k+1)*step):c for k,c in chunks.items() }