Search code examples
pythonlistdictionarypython-3.4pypy

Idiomatic way to process lists and dicts


Recently I've been discovering "the Python way" of processing lists, dicts, sets, etc. To this extent I've modified my function to calculate the first N prime numbers, let's call it the 'Pure Dict' version:

def findprimeuptoG(x):
    n_primes = {}

    for i in range(2, x):

        if i not in n_primes.values():
            n_primes[i]=i

        for k, v in n_primes.items():
            if i > v:
                n_primes[k] += k

    return sorted(n_primes)

This function works as follows:

  1. maintain a list of primes and a list of integer multiples of those same primes in a dict

  2. those multiples should be bigger or equal to some integer i

  3. if a number is not present in the list of integer multiples of existing primes then it must be a prime and is added to the list of primes

  4. increase i by starting with 2 (smallest prime), up to x

  5. return a list of primes

I've rewritten this function several times using lists, sets, but this version seems to be most idiomatic one. It is short and it reads easily.

If anyone would be kind enough to let me know if this could be written more clearly, please comment as I would love to read it.

And now the question: The first version of this function is not so clean and is much more C-like:

def findprimeupto(x):
    primes = []
    n_primes = []

    for i in range(2, x):

        if i not in n_primes:
            primes.append(i)
            n_primes.append(i)

        for j in range(len(primes)):
            if i > n_primes[j]:
                n_primes[j] += primes[j]

    return primes

But this first version is absolutely the fastest when I run it with a pypy compiler:

python3:

Primes up to: 100000

Algo: Clean version       , primes: 9592, time: 102.74523687362671

Algo: Dict, Set           , primes: 9592, time: 58.230621337890625

Algo: **FirstVersion**        , primes: 9592, time: 59.945680379867554

Algo: List of List[1]        , primes: 9592, time: 71.41077852249146

Algo: List of MutableNum  , primes: 9592, time: 292.3777365684509

Algo: **Pure Dict**           , primes: 9592, time: 56.381882667541504

pypy (ver 2.3.1):

Primes up to: 100000

Algo: Clean version       , primes: 9592, time: 29.3849189281

Algo: Dict, Set           , primes: 9592, time: 85.8557658195

Algo: **FirstVersion**        , primes: 9592, time: 1.11557507515

Algo: List of List        , primes: 9592, time: 42.2394959927

Algo: List of MutableNum  , primes: 9592, time: 38.764893055

Algo: **Pure Dict**           , primes: 9592, time: 110.416568995

I understand the performance hit 'Pure Dict' version got was due to the fact I did not use iterators in my loops, still the speedup 'FirstVersion' got was phenomenal.

As most of our code will probably end up being compiled in a production, should we write code in a more C-like fashion to and not idiomatic Python?

EDIT:

to remove any confusion whether I should have used lists instead of dict I'm submitting another version of this function which I call 'Clean version'. This version uses no direct access to the Nth element of a list, instead it iterates over lists in what I believe to be most Pythonistic way (btw this version is most similar to the lisp version of the same code :)

def findprimeuptoB(x):
    primes = []
    n_primes = []

    for i in range(2, x):

        if not (i in n_primes):
            primes.append(i)
            n_primes.append(i)

        new_n_primes = []

        for prime, n_prime in zip(primes, n_primes):
            if i > n_prime:
                new_n_primes.append(prime + n_prime)
            else:
                new_n_primes.append(n_prime)

        n_primes = new_n_primes

    return primes

Solution

  • Yes, if you care about performance, 'First Version' is the way to go. You can see what's going on using cProfile.

    For reference, on pypy 2.5.0, running python -m cProfile -s cumulative x.py with 'First Version' gives me:

    ncalls  tottime  percall  cumtime  percall filename:lineno(function)
         1    0.000    0.000    0.727    0.727 x.py:1(<module>)
         1    0.724    0.724    0.727    0.727 x.py:29(findprimeupto)
     99999    0.002    0.000    0.002    0.000 {len}
     99999    0.001    0.000    0.001    0.000 {range}
     19184    0.001    0.000    0.001    0.000 {method 'append' of 'list' objects}
         1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
    

    OTOH, with 'Pure Dict', I get:

    ncalls  tottime  percall  cumtime  percall filename:lineno(function)
         1    0.000    0.000   16.864   16.864 x.py:1(<module>)
         1    1.441    1.441   16.864   16.864 x.py:1(findprimeuptoG)
     99998   12.376    0.000   12.376    0.000 {method 'items' of 'dict' objects}
     99998    3.047    0.000    3.047    0.000 {method 'values' of 'dict' objects}
         1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
         1    0.000    0.000    0.000    0.000 {len}
         1    0.000    0.000    0.000    0.000 {range}
    

    which shows that most of the time is wasted creating the temporary lists n_primes.items() and n_primes.values().

    Now, there is an easy fix for this: replacing .items() and .values() with their respective iterator versions, .iteritems() and .itervalues(). However, the result is still much slower than the list version, because dicts are more complicated structure than lists, and low-level dict operations are therefore much slower than the equivalent list operations:

    ncalls  tottime  percall  cumtime  percall filename:lineno(function)
         1    0.000    0.000    3.155    3.155 x.py:1(<module>)
         1    3.147    3.147    3.155    3.155 x.py:15(findprimeuptoH)
     99998    0.006    0.000    0.006    0.000 {method 'itervalues' of 'dict' objects}
     99998    0.002    0.000    0.002    0.000 {method 'iteritems' of 'dict' objects}
         1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
         1    0.000    0.000    0.000    0.000 {len}
         1    0.000    0.000    0.000    0.000 {range}
    

    Finally, 'Clean Version' is clearly rather bad, since it creates a new n_primes list on every iteration. Indeed, I time it at 21.795 seconds.

    Conclusion: Creating new containers (lists, dicts, ...) is very slow, avoid it whenever you can. Also, dicts are slower than lists. In this problem, you don't actually need dicts, so you should use lists.