Search code examples
pythonpython-3.xlist-comprehensionpython-internals

What are these extra symbols in a comprehension's symtable?


I'm using symtable to get the symbol tables of a piece of code. Curiously, when using a comprehension (listcomp, setcomp, etc.), there are some extra symbols I didn't define.

Reproduction (using CPython 3.6):

import symtable

root = symtable.symtable('[x for x in y]', '?', 'exec')
# Print symtable of the listcomp
print(root.get_children()[0].get_symbols())

Output:

[<symbol '.0'>, <symbol '_[1]'>, <symbol 'x'>]

Symbol x is expected. But what are .0 and _[1]?

Note that with any other non-comprehension construct I'm getting exactly the identifiers I used in the code. E.g., lambda x: y only results in the symbols [<symbol 'x'>, <symbol 'y'>].

Also, the docs say that symtable.Symbol is...

An entry in a SymbolTable corresponding to an identifier in the source.

...although these identifiers evidently don't appear in the source.


Solution

  • The two names are used to implement list comprehensions as a separate scope, and they have the following meaning:

    • .0 is an implicit argument, used for the iterable (sourced from y in your case).
    • _[1] is a temporary name in the symbol table, used for the target list. This list eventually ends up on the stack.*

    A list comprehension (as well as dict and set comprehension and generator expressions) is executed in a new scope. To achieve this, Python effectively creates a new anonymous function.

    Because it is a function, really, you need to pass in the iterable you are looping over as an argument. This is what .0 is for, it is the first implicit argument (so at index 0). The symbol table you produced explicitly lists .0 as an argument:

    >>> root = symtable.symtable('[x for x in y]', '?', 'exec')
    >>> type(root.get_children()[0])
    <class 'symtable.Function'>
    >>> root.get_children()[0].get_parameters()
    ('.0',)
    

    The first child of your table is a function with one argument named .0.

    A list comprehension also needs to build the output list, and that list could be seen as a local too. This is the _[1] temporary variable. It never actually becomes a named local variable in the code object that is produced; this temporary variable is kept on the stack instead.

    You can see the code object produced when you use compile():

    >>> code_object = compile('[x for x in y]', '?', 'exec')
    >>> code_object
    <code object <module> at 0x11a4f3ed0, file "?", line 1>
    >>> code_object.co_consts[0]
    <code object <listcomp> at 0x11a4ea8a0, file "?", line 1>
    

    So there is an outer code object, and in the constants, is another, nested code object. That latter one is the actual code object for the loop. It uses .0 and x as local variables. It also takes 1 argument; the names for arguments are the first co_argcount values in the co_varnames tuple:

    >>> code_object.co_consts[0].co_varnames
    ('.0', 'x')
    >>> code_object.co_consts[0].co_argcount
    1
    

    So .0 is the argument name here.

    The _[1] temporary variable is handled on the stack, see the disassembly:

    >>> import dis
    >>> dis.dis(code_object.co_consts[0])
      1           0 BUILD_LIST               0
                  2 LOAD_FAST                0 (.0)
            >>    4 FOR_ITER                 8 (to 14)
                  6 STORE_FAST               1 (x)
                  8 LOAD_FAST                1 (x)
                 10 LIST_APPEND              2
                 12 JUMP_ABSOLUTE            4
            >>   14 RETURN_VALUE
    

    Here we see .0 referenced again. _[1] is the BUILD_LIST opcode pushing a list object onto the stack, then .0 is put on the stack for the FOR_ITER opcode to iterate over (the opcode removes the iterable from .0 from the stack again).

    Each iteration result is pushed onto the stack by FOR_ITER, popped again and stored in x with STORE_FAST, then loaded onto the stack again with LOAD_FAST. Finally LIST_APPEND takes the top element from the stack, and adds it to the list referenced by the next element on the stack, so to _[1].

    JUMP_ABSOLUTE then brings us back to the top of the loop, where we continue to iterate until the iterable is done. Finally, RETURN_VALUE returns the top of the stack, again _[1], to the caller.

    The outer code object does the work of loading the nested code object and calling it as a function:

    >>> dis.dis(code_object)
      1           0 LOAD_CONST               0 (<code object <listcomp> at 0x11a4ea8a0, file "?", line 1>)
                  2 LOAD_CONST               1 ('<listcomp>')
                  4 MAKE_FUNCTION            0
                  6 LOAD_NAME                0 (y)
                  8 GET_ITER
                 10 CALL_FUNCTION            1
                 12 POP_TOP
                 14 LOAD_CONST               2 (None)
                 16 RETURN_VALUE
    

    So this makes a function object, with the function named <listcomp> (helpful for tracebacks), loads y, produces an iterator for it (the moral equivalent of iter(y), and calls the function with that iterator as the argument.

    If you wanted to translate that to Psuedo-code, it would look like:

    def <listcomp>(.0):
        _[1] = []
        for x in .0:
            _[1].append(x)
        return _[1]
    
    <listcomp>(iter(y))
    

    The _[1] temporary variable is of course not needed for generator expressions:

    >>> symtable.symtable('(x for x in y)', '?', 'exec').get_children()[0].get_symbols()
    [<symbol '.0'>, <symbol 'x'>]
    

    Instead of appending to a list, the generator expression function object yields the values:

    >>> dis.dis(compile('(x for x in y)', '?', 'exec').co_consts[0])
      1           0 LOAD_FAST                0 (.0)
            >>    2 FOR_ITER                10 (to 14)
                  4 STORE_FAST               1 (x)
                  6 LOAD_FAST                1 (x)
                  8 YIELD_VALUE
                 10 POP_TOP
                 12 JUMP_ABSOLUTE            2
            >>   14 LOAD_CONST               0 (None)
                 16 RETURN_VALUE
    

    Together with the outer bytecode, the generator expression is equivalent to:

    def <genexpr>(.0):
        for x in .0:
            yield x
    
    <genexpr>(iter(y))
    

    * The temporary variable is actually no longer needed; they were used in the initial implementation of comprehensions, but this commit from April 2007 moved the compiler to just using the stack, and this has been the norm for all of the 3.x releases, as well as Python 2.7. It still is easier to think of the generated name as a reference to the stack. Because the variable is no longer needed, I filed issue 32836 to have it removed, and Python 3.8 and onwards will no longer include it in the symbol table.

    In Python 2.6, you can still see the actual temporary name in the disassembly:

    >>> import dis
    >>> dis.dis(compile('[x for x in y]', '?', 'exec'))
      1           0 BUILD_LIST               0
                  3 DUP_TOP
                  4 STORE_NAME               0 (_[1])
                  7 LOAD_NAME                1 (y)
                 10 GET_ITER
            >>   11 FOR_ITER                13 (to 27)
                 14 STORE_NAME               2 (x)
                 17 LOAD_NAME                0 (_[1])
                 20 LOAD_NAME                2 (x)
                 23 LIST_APPEND
                 24 JUMP_ABSOLUTE           11
            >>   27 DELETE_NAME              0 (_[1])
                 30 POP_TOP
                 31 LOAD_CONST               0 (None)
                 34 RETURN_VALUE
    

    Note how the name actually has to be deleted again!