python iterator iterable python-collections

python counting elements in iterable with filter

To count the elements in a list, you can use collections.Counter, but what if only some of the elements have to be counted?

I've set up this example (please note: numpy is just for convenience. In general the list will contain arbitrary python objects):

num_samples = 10000000
num_unique = 1000
numbers = np.random.randint(0, num_unique, num_samples)

I would like to count how often a number occurs in this list, but I'm only interested in numbers <= 10.

This is the baseline to beat. The Counter just counts everything, which should produce some overhead.

%%time
counter = Counter(numbers)

CPU times: user 1.38 s, sys: 7.49 ms, total: 1.39 s
Wall time: 1.39 s

Filtering the iterable while counting it doesn't seem possible. But the following code is very bad style, it goes through the list twice, instead of using a single loop:

%%time
numbers = [number for number in numbers if number<=10]
counter = Counter(numbers)

CPU times: user 1.3 s, sys: 22.1 ms, total: 1.32 s
Wall time: 1.33 s

That speedup is basically negligible. Let's try a single loop:

%%time

counter = defaultdict(int)
for number in numbers:
    if number > 10:
        continue
    counter[number]+=1

CPU times: user 1.99 s, sys: 11.5 ms, total: 2 s
Wall time: 2.01 s

Well my single loop is much worse. I assume that Counter profits from a C based implementation ?

The next thing I tried was switching my list expression for a generator expression. In principle this should mean that the generator is only looped through once, while it is consumed by the Counter. The numbers are disappointing though, it is basically as fast as the vanilla Counter:

%%time
iterator = (number for number in numbers if number <= 10)
counter = Counter(iterator)

CPU times: user 1.38 s, sys: 8.51 ms, total: 1.39 s
Wall time: 1.39 s

At this point I took a step back and re-ran the numbers a few times. The three Counter versions (unfiltered, list comprehension, generator expression) are almost equal in speed. The defaultdict version is consistently much slower.

How can I efficiently count elements in a python list, while filtering the elements at the same time ?

Solution

If this is about large numpy arrays you'd better take advantage of vectorized numpy operations.

%%time
np.unique(numbers[numbers <= 10], return_counts=True)

Output:

Wall time: 31.2 ms

(array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10]),
 array([10055, 10090,  9941, 10002,  9994,  9989, 10070,  9859, 10038,
        10028,  9965], dtype=int64))

For comparison, my own timing of your code gave slighly higher times than yours.