python python-3.x ternary dictionary-comprehension

Incrementing dict values with dict comprehension

I'm trying to do the following expression with dict comprehension and a ternary operation in python:

for num in ar:
     if num in seen_dict:
         seen_dict[num] += 1
     else:
         seen_dict[num] = 1

I tried this:

seen_dict = { num: seen_dict[num] += 1 for num in ar if num in seen_dict else seen_dict[num] = 1}

and several permutations thereof, but I keep getting syntax errors. Is it possible to do what I want?

UPDATE

This is the correct syntax, but not my dict only comes back with 1's: seen_dict = { num: (seen_dict[num] + 1) if num in seen_dict else 1 for num in ar }

Can someone explain why this doesn't function the same way as the for loop? Thank you.

Solution

Don't. It seems like using dict comprehensions for this should be a good idea, but it's actually a horrible trap. Use collections.Counter:

import counts

seen_dict = collections.Counter(ar)

or if you don't want to do that, then stick with the loop.

The problem with trying to use a dict comprehension is that a dict comprehension has no good way to maintain state or interleave the computation of the values of each key. Each value must be computed in a single expression. In contrast, the best way to solve your counting problem is to make a single pass over ar and update each element's count as you go.

The restrictions of a comprehension lead to horribly inefficient attempts like

seen_dict = {val: ar.count(val) for val in ar}

which makes a number of passes over ar equal to the length of ar, or the slightly more efficient but still horribly suboptimal

seen_dict = {val: ar.count(val) for val in set(ar)}

which only needs to make len(set(ar)) passes, or for people a bit more familiar with the standard library,

from itertools import groupby
seen_dict = {val: sum(1 for _ in group) for val, group in groupby(sorted(ar))}

which at least isn't quadratic time, but is still O(nlogn) for a length-n ar.

If we run a timing of these four snippets with input list(range(10000)):

from collections import Counter
from itertools import groupby
from timeit import timeit

ar = list(range(10000))

print(timeit('Counter(ar)', number=1, globals=globals()))
print(timeit('{val: ar.count(val) for val in ar}', number=1, globals=globals()))
print(timeit('{val: ar.count(val) for val in set(ar)}', number=1, globals=globals()))
print(timeit('{val: sum(1 for _ in group) for val, group in groupby(sorted(ar))}',
             number=1, globals=globals()))

We get the following output:

0.0005530156195163727
1.0503493696451187
1.0463058911263943
0.00422721728682518

Counter finishes in half a millisecond, while the count snippets both take over a second. (The set version seems to have a lower runtime due to some sort of first-run effect slowing down the other version; swapping the order of the set and non-set version usually reverses the relative timing of those versions. The deduplication of set doesn't help in this test, since the input has no duplicates.)

For a longer input, relying on count would be even more prohibitively expensive. Relying on count could easily take days for an input that Counter would still finish in under a second.