Search code examples
pythonparsinggrammartokenize

How to edit parse trees *before* the abstract syntax tree?


I want to understand how to effectively use the stdlib parser module, since sometimes ast.parse is losing too much information (it eats whitespace, comments, extra parens etc - details which are relevant to source code formatters, to give an example)

>>> parser.expr('(*x,)').tolist()
[258,
 [332,
  [306,
   [310,
    [311,
     [312,
      [313,
       [316,
        [317,
         [318,
          [319,
           [320,
            [321,
             [322,
              [323,
               [324,
                [325,
                 [7, '('],
                 [326,
                  [315,
                   [16, '*'],
                   [316,
                    [317,
                     [318,
                      [319,
                       [320, [321, [322, [323, [324, [325, [1, 'x']]]]]]]]]]]],
                  [12, ',']],
                 [8, ')']]]]]]]]]]]]]]]]],
 [4, ''],
 [0, '']]

What are all those numbers about and how do the relate to the grammar? How should you interpret the structure and nesting of this parse tree? Is there a way to pretty-print it with indentation and symbol names instead of codes?


Solution

  • You ask one question in your title and some different questions in the body. This answer mostly addresses the question in the body, since I'm not sure what kind of pre-AST transformation you're seeking.

    I suspect the answer to your titular question is much more complicated. The parse module doesn't preserve formatting or comments either, for example (although it does identify line/column numbers for tokens, if requested, from which you can derive the horizontal formatting of non-comment lines). If you want to write a chromatic viewer, you're going to want to use the tokenize module. That module doesn't parse, so if you want a parse tree as well, you'll have to use either parse or ast and then correlate the tokens with the token stream returned by tokenize.

    What are all those numbers about and how do the relate to the grammar?

    They either identify non-terminals (grammar productions) or terminals (tokens). The numeric ranges for these two categories don't overlap, so there is no confusion.

    How should you interpret the structure and nesting of this parse tree?

    It represents the complete, unedited (as far as I know) syntax tree starting from the root production (either eval_input or file_input, depending on whether you called parser.expr or parser.suite). Terminal nodes start with a terminal index number, followed by the textual representation of the token, followed by location information if requested. Non-terminal nodes start with a non-terminal index number, and are followed by the child nodes. (Apparently there is always at least one child node; the Python grammar doesn't have nullable non-terminals.)

    Note that nesting is very deep on most Python syntax trees because the grammar has a lot of unit productions. Part of what the ast module does is condense out chains of unit productions.

    Is there a way to pretty-print it with indentation and symbol names instead of codes?

    Sure:

    import parse
    import pprint
    import symbol
    import token
    
    def symbolic(root):
      if root[0] in token.tok_name:
        return [token.tok_name[root[0]], *root[1:]]
      elif root[0] in symbol.sym_name:
        return [symbol.sym_name[root[0]], *map(symbolic, root[1:])]
      else:
        # Not optimal since it doesn't symbolise children, if any.
        # But it should never happen, anyway.
        return root
    
    >>> pprint(symbolic(parser.expr("a if True else b").tolist()))
    ['eval_input',
     ['testlist',
      ['test',
       ['or_test',
        ['and_test',
         ['not_test',
          ['comparison',
           ['expr',
            ['xor_expr',
             ['and_expr',
              ['shift_expr',
               ['arith_expr',
                ['term',
                 ['factor',
                  ['power', ['atom_expr', ['atom', ['NAME', 'a']]]]]]]]]]]]]]],
       ['NAME', 'if'],
       ['or_test',
        ['and_test',
         ['not_test',
          ['comparison',
           ['expr',
            ['xor_expr',
             ['and_expr',
              ['shift_expr',
               ['arith_expr',
                ['term',
                 ['factor',
                  ['power', ['atom_expr', ['atom', ['NAME', 'True']]]]]]]]]]]]]]],
       ['NAME', 'else'],
       ['test',
        ['or_test',
         ['and_test',
          ['not_test',
           ['comparison',
            ['expr',
             ['xor_expr',
              ['and_expr',
               ['shift_expr',
                ['arith_expr',
                 ['term',
                  ['factor',
                   ['power', ['atom_expr', ['atom', ['NAME', 'b']]]]]]]]]]]]]]]]]],
     ['NEWLINE', ''],
     ['ENDMARKER', '']]
    

    As an example of how this tree lines up with the grammar, the subtree headed by test in the third line of the above output corresponds to the grammar production

    test: or_test ['if' or_test 'else' test]
    

    Here's an example of a file_input parse:

     >>> pprint(symbolic(parser.suite("import sys").tolist()))
    ['file_input',
     ['stmt',
      ['simple_stmt',
       ['small_stmt',
        ['import_stmt',
         ['import_name',
          ['NAME', 'import'],
          ['dotted_as_names',
           ['dotted_as_name', ['dotted_name', ['NAME', 'sys']]]]]]],
       ['NEWLINE', '']]],
     ['NEWLINE', ''],
     ['ENDMARKER', '']]