I'm trying to do the following expression with dict comprehension and a ternary operation in python:
for num in ar:
if num in seen_dict:
seen_dict[num] += 1
else:
seen_dict[num] = 1
I tried this:
seen_dict = { num: seen_dict[num] += 1 for num in ar if num in seen_dict else seen_dict[num] = 1}
and several permutations thereof, but I keep getting syntax errors. Is it possible to do what I want?
UPDATE
This is the correct syntax, but not my dict only comes back with 1's:
seen_dict = { num: (seen_dict[num] + 1) if num in seen_dict else 1 for num in ar }
Can someone explain why this doesn't function the same way as the for loop? Thank you.
Don't. It seems like using dict comprehensions for this should be a good idea, but it's actually a horrible trap. Use collections.Counter
:
import counts
seen_dict = collections.Counter(ar)
or if you don't want to do that, then stick with the loop.
The problem with trying to use a dict comprehension is that a dict comprehension has no good way to maintain state or interleave the computation of the values of each key. Each value must be computed in a single expression. In contrast, the best way to solve your counting problem is to make a single pass over ar
and update each element's count as you go.
The restrictions of a comprehension lead to horribly inefficient attempts like
seen_dict = {val: ar.count(val) for val in ar}
which makes a number of passes over ar
equal to the length of ar
, or the slightly more efficient but still horribly suboptimal
seen_dict = {val: ar.count(val) for val in set(ar)}
which only needs to make len(set(ar))
passes, or for people a bit more familiar with the standard library,
from itertools import groupby
seen_dict = {val: sum(1 for _ in group) for val, group in groupby(sorted(ar))}
which at least isn't quadratic time, but is still O(nlogn) for a length-n ar
.
If we run a timing of these four snippets with input list(range(10000))
:
from collections import Counter
from itertools import groupby
from timeit import timeit
ar = list(range(10000))
print(timeit('Counter(ar)', number=1, globals=globals()))
print(timeit('{val: ar.count(val) for val in ar}', number=1, globals=globals()))
print(timeit('{val: ar.count(val) for val in set(ar)}', number=1, globals=globals()))
print(timeit('{val: sum(1 for _ in group) for val, group in groupby(sorted(ar))}',
number=1, globals=globals()))
We get the following output:
0.0005530156195163727
1.0503493696451187
1.0463058911263943
0.00422721728682518
Counter
finishes in half a millisecond, while the count
snippets both take over a second. (The set
version seems to have a lower runtime due to some sort of first-run effect slowing down the other version; swapping the order of the set
and non-set
version usually reverses the relative timing of those versions. The deduplication of set
doesn't help in this test, since the input has no duplicates.)
For a longer input, relying on count
would be even more prohibitively expensive. Relying on count
could easily take days for an input that Counter
would still finish in under a second.