Search code examples
pythonsetgeneratorset-comprehension

How do python Set Comprehensions work?


Q1 - Is the following a set() of a generator expression or a set comprehension? (Or are they same? If so, are list & dict comprehensions also corresponding type-cast on generators?)

my_set = {x for x in range(10)}

Q2 - Does the evaluation consider duplicate values & then remove them by applying set()?

dup_set = {x for x in [0, 1, 2, 0, 1, 2]}

Does the comprehension perform (speed-wise) better than regular for loops?

Update - I tried using timeit for speed comparisons. Am not sure if I am being just (fair) about it.

C:\>python -m timeit "s = set()" "for x in range(10):" "
  s.add(x)"
100000 loops, best of 3: 2.3 usec per loop

C:\>python -m timeit "s = {x for x in range(10)}"
1000000 loops, best of 3: 1.68 usec per loop

Now, using some conditionals

C:\>python -m timeit "s = set()" "for x in range(10):" "
  if x%2: s.add(x)"
100000 loops, best of 3: 2.27 usec per loop

C:\>python -m timeit "s = {x for x in range(10) if x%2}"
1000000 loops, best of 3: 1.83 usec per loop

So, there is quite some difference, is it due to the functionality being hardcoded in c?


Solution

  • Q1 : Yes, yes, yes and yes. Or at least they behave like this. It's a little bit different if you're taking a look at the bytecode. Let's disassembly this code (Python 2.7) :

    def list_comp(l):
        return [x+1 for x in l]
    
    def dict_comp(l):
        return {x+1:0 for x in l}
    
    def set_comp(l):
        return {x+1 for x in l}
    
    def generator(l):
        return (x+1 for x in l)
    

    This is what you get:

    Disassembly of list_comp:
      2           0 BUILD_LIST              0
                  3 LOAD_FAST               0 (l)
                  6 GET_ITER            
            >>    7 FOR_ITER               16 (to 26)
                 10 STORE_FAST              1 (x)
                 13 LOAD_FAST               1 (x)
                 16 LOAD_CONST              1 (1)
                 19 BINARY_ADD          
                 20 LIST_APPEND             2
                 23 JUMP_ABSOLUTE           7
            >>   26 RETURN_VALUE
    Disassembly of dict_comp:
      5           0 LOAD_CONST              1 (<code object <dictcomp> at 029DEE30)
                  3 MAKE_FUNCTION           0
                  6 LOAD_FAST               0 (l)
                  9 GET_ITER            
                 10 CALL_FUNCTION           1
                 13 RETURN_VALUE  
    Disassembly of set_comp:
      8           0 LOAD_CONST              1 (<code object <setcomp> at 029DECC8)
                  3 MAKE_FUNCTION           0
                  6 LOAD_FAST               0 (l)
                  9 GET_ITER            
                 10 CALL_FUNCTION           1
                 13 RETURN_VALUE  
    Disassembly of generator:
     11           0 LOAD_CONST              1 (<code object <genexpr> at 02A8FD58)
                  3 MAKE_FUNCTION           0
                  6 LOAD_FAST               0 (l)
                  9 GET_ITER            
                 10 CALL_FUNCTION           1
                 13 RETURN_VALUE                     
    

    The bytecode is barely the same for the dict comprenhension, the set comprehension and the generator. They all load a code object (<dictcomp>, <setcomp> or <genexpr>) and then make a callable function out of it. The list comprehension is different because it generates the bytecode corresponding to your list comprehension. This time it is interpreted and thus not native.

    Q2 : It doesn't really consider duplicate values since it creates a comprehension with the list you gave. And then it creates the set with the comprehension.

    About timing : List/Dict/Set comprehensions tend to be faster than anything else. Even if they're interpreted, the bytecode generated is optimized for most of the cases with special bytecode instructions like SET_ADD, LIST_APPEND or MAP_ADD.