Search code examples
pythongrammarpython-internals

Why does Python's grammar specification not include docstrings and comments?


I am consulting the official Python grammar specification as of Python 3.6.

I am unable to find any syntax for comments (they appear prepended with a #) and docstrings (they should appear with '''). A quick look at the lexical analysis page didn't help either - docstrings are defined there as longstrings but do not appear in the grammar specifications. A type named STRING appears further, but no reference to its definition takes place.

Given this, I am curious about how the CPython compiler knows what comments and docstrings are. How is this feat accomplished?

I originally guessed that comments and docstrings are removed in a first pass by the CPython compiler, but then that beggars the question of how help() is able to render the relevant docstrings.


Solution

  • Section 1

    What happens to comments?

    Comments (anything preceded by a #) are ignored during tokenization/lexical analysis, so there is no need to write rules to parse them. They do not provide any semantic information to the interpreter/compiler, since they only serve to improve the verbosity of your program for the reader's sake, and so they are ignored.

    Here's the lex specification for the ANSI C programming language: http://www.quut.com/c/ANSI-C-grammar-l-1998.html. I'd like to draw your attention to the way comments are being processed here:

    "/*"            { comment(); }
    "//"[^\n]*      { /* consume //-comment */ }
    

    Now, take a look at the rule for int.

    "int"           { count(); return(INT); }
    

    Here's the lex function to process int and other tokens:

    void count(void)
    {
        int i;
    
        for (i = 0; yytext[i] != '\0'; i++)
            if (yytext[i] == '\n')
                column = 0;
            else if (yytext[i] == '\t')
                column += 8 - (column % 8);
            else
                column++;
    
        ECHO;
    }
    

    You see here it ends with the ECHO statement, meaning it is a valid token and must be parsed.

    Now, here's the lex function to process comments:

    void comment(void)
    {
        char c, prev = 0;
    
        while ((c = input()) != 0)      /* (EOF maps to 0) */
        {
            if (c == '/' && prev == '*')
                return;
            prev = c;
        }
        error("unterminated comment");
    }
    

    There's no ECHO here. So, nothing is returned.

    This is a representative example, but python does the exact same thing.


    Section 2

    What happens to docstrings?

    Note: This section of my answer is meant to be a complement to @MartijnPieters' answer. It is not meant to replicate any of the information he has furnished in his post. Now, with that said,...

    I originally guessed that comments and docstrings are removed in a first pass by the CPython compiler[...]

    Docstrings (string literals that are not assigned to any variable name, anything within '...', "...", '''...''', or """...""") are indeed processed. They are parsed as simple string literals (STRING+ token), as Martijn Pieters mentions in his answer. As of the current docs, it is only mentioned in passing that docstrings are assigned to the function/class/module's __doc__ attribute. How it is done is not really mentioned in depth anywhere.

    What actually happens is that they are tokenised and parsed as string literals and the resultant parse tree generated will contain them. From the parse tree the byte code is generated, with the docstrings in their rightful place in the __doc__ attribute (they are not explicitly a part of the byte code as illustrated below). I won't go into details since the answer I linked above describes the same in very nice detail.

    Of course, it is possible to ignore them completely. If you use python -OO (the -OO flag stands for "optimize intensely", as opposed to -O which stands for "optimize mildly"), with the resultant byte code stored in .pyo files, which exclude the docstrings.

    An illustration can be seen below:

    Create a file test.py with the following code:

    def foo():
        """ docstring """
        pass
    

    Now, we'll compile this code with the normal flags set.

    >>> code = compile(open('test.py').read(), '', 'single')
    >>> import dis
    >>> dis.dis(code)
      1           0 LOAD_CONST               0 (<code object foo at 0x102b20ed0, file "", line 1>)
                  2 LOAD_CONST               1 ('foo')
                  4 MAKE_FUNCTION            0
                  6 STORE_NAME               0 (foo)
                  8 LOAD_CONST               2 (None)
                 10 RETURN_VALUE
    

    As you can see, there is no mention of our docstring in the byte code. However, they are there. To get the docstring, you can do...

    >>> code.co_consts[0].co_consts
    (' docstring ', None)
    

    So, as you can see, the docstring does remain, just not as a part of the main bytecode. Now, let's recompile this code, but with the optimisation level set to 2 (equivalent of the -OO switch):

    >>> code = compile(open('test.py').read(), '', 'single', optimize=2)
    >>> dis.dis(code)
      1           0 LOAD_CONST               0 (<code object foo at 0x102a95810, file "", line 1>)
                  2 LOAD_CONST               1 ('foo')
                  4 MAKE_FUNCTION            0
                  6 STORE_NAME               0 (foo)
                  8 LOAD_CONST               2 (None)
                 10 RETURN_VALUE
    

    No, difference, but...

    >>> code.co_consts[0].co_consts
    (None,)
    

    The docstrings have gone now.

    The -O and -OO flag only remove things (optimisation of byte code is done by default... -O removes assert statements and if __debug__: suites from the generated bytecode, while -OO ignores docstrings in addition). The resultant compile time will decrease slightly. In addition, the speed of execution remains the same, unless you have a large amount of assert and if __debug__: statements, otherwise making no difference to performance.

    Also, do remember that the docstrings are preserved only if they are the first thing in the function/class/module definition. All additional strings are simply dropped during compilation. If you change test.py to the following:

    def foo():
        """ docstring """
    
        """test"""
        pass
    

    And then repeat the same process with optimization=0, this is is stored in the co_consts variable upon compilation:

    >>> code.co_consts[0].co_consts
    (' docstring ', None)
    

    Meaning, """ test """ has been ignored. It'll interest you to know that this removal is done as part of the base optimisation on the byte code.


    Section 3

    Additional reading

    (You may find these references as interesting as I did.)

    1. What does Python optimization (-O or PYTHONOPTIMIZE) do?

    2. What do the python file extensions, .pyc .pyd .pyo stand for?

    3. Are Python docstrings and comments stored in memory when a module is loaded?

    4. Working with compile()

    5. The dis module

    6. peephole.c (courtesy Martijn) - The source code for all compiler optimisations. This is particularly fascinating, if you can understand it!