I want to understand how to effectively use the stdlib parser
module, since sometimes ast.parse
is losing too much information (it eats whitespace, comments, extra parens etc - details which are relevant to source code formatters, to give an example)
>>> parser.expr('(*x,)').tolist()
[258,
[332,
[306,
[310,
[311,
[312,
[313,
[316,
[317,
[318,
[319,
[320,
[321,
[322,
[323,
[324,
[325,
[7, '('],
[326,
[315,
[16, '*'],
[316,
[317,
[318,
[319,
[320, [321, [322, [323, [324, [325, [1, 'x']]]]]]]]]]]],
[12, ',']],
[8, ')']]]]]]]]]]]]]]]]],
[4, ''],
[0, '']]
What are all those numbers about and how do the relate to the grammar? How should you interpret the structure and nesting of this parse tree? Is there a way to pretty-print it with indentation and symbol names instead of codes?
You ask one question in your title and some different questions in the body. This answer mostly addresses the question in the body, since I'm not sure what kind of pre-AST transformation you're seeking.
I suspect the answer to your titular question is much more complicated. The parse
module doesn't preserve formatting or comments either, for example (although it does identify line/column numbers for tokens, if requested, from which you can derive the horizontal formatting of non-comment lines). If you want to write a chromatic viewer, you're going to want to use the tokenize
module. That module doesn't parse, so if you want a parse tree as well, you'll have to use either parse
or ast
and then correlate the tokens with the token stream returned by tokenize
.
What are all those numbers about and how do the relate to the grammar?
They either identify non-terminals (grammar productions) or terminals (tokens). The numeric ranges for these two categories don't overlap, so there is no confusion.
How should you interpret the structure and nesting of this parse tree?
It represents the complete, unedited (as far as I know) syntax tree starting from the root production (either eval_input
or file_input
, depending on whether you called parser.expr
or parser.suite
). Terminal nodes start with a terminal index number, followed by the textual representation of the token, followed by location information if requested. Non-terminal nodes start with a non-terminal index number, and are followed by the child nodes. (Apparently there is always at least one child node; the Python grammar doesn't have nullable non-terminals.)
Note that nesting is very deep on most Python syntax trees because the grammar has a lot of unit productions. Part of what the ast
module does is condense out chains of unit productions.
Is there a way to pretty-print it with indentation and symbol names instead of codes?
Sure:
import parse
import pprint
import symbol
import token
def symbolic(root):
if root[0] in token.tok_name:
return [token.tok_name[root[0]], *root[1:]]
elif root[0] in symbol.sym_name:
return [symbol.sym_name[root[0]], *map(symbolic, root[1:])]
else:
# Not optimal since it doesn't symbolise children, if any.
# But it should never happen, anyway.
return root
>>> pprint(symbolic(parser.expr("a if True else b").tolist()))
['eval_input',
['testlist',
['test',
['or_test',
['and_test',
['not_test',
['comparison',
['expr',
['xor_expr',
['and_expr',
['shift_expr',
['arith_expr',
['term',
['factor',
['power', ['atom_expr', ['atom', ['NAME', 'a']]]]]]]]]]]]]]],
['NAME', 'if'],
['or_test',
['and_test',
['not_test',
['comparison',
['expr',
['xor_expr',
['and_expr',
['shift_expr',
['arith_expr',
['term',
['factor',
['power', ['atom_expr', ['atom', ['NAME', 'True']]]]]]]]]]]]]]],
['NAME', 'else'],
['test',
['or_test',
['and_test',
['not_test',
['comparison',
['expr',
['xor_expr',
['and_expr',
['shift_expr',
['arith_expr',
['term',
['factor',
['power', ['atom_expr', ['atom', ['NAME', 'b']]]]]]]]]]]]]]]]]],
['NEWLINE', ''],
['ENDMARKER', '']]
As an example of how this tree lines up with the grammar, the subtree headed by test
in the third line of the above output corresponds to the grammar production
test: or_test ['if' or_test 'else' test]
Here's an example of a file_input
parse:
>>> pprint(symbolic(parser.suite("import sys").tolist()))
['file_input',
['stmt',
['simple_stmt',
['small_stmt',
['import_stmt',
['import_name',
['NAME', 'import'],
['dotted_as_names',
['dotted_as_name', ['dotted_name', ['NAME', 'sys']]]]]]],
['NEWLINE', '']]],
['NEWLINE', ''],
['ENDMARKER', '']]