Search code examples
pythongrammarabstract-syntax-treebnfcode-documentation

How many BNF-like grammar specifications does Python have?


There are at least 3 grammar-like specifications within the Python 3.9.1 reference documents 'library-3.9.1.pdf' and 'reference-3.9.1.pdf' (aka https://docs.python.org/3.9/library/ast.html and https://docs.python.org/3.9/reference/grammar.html).

There is the "abstract grammar" library-3.9.1.pdf, page 1895 that looks like this:

-- ASDL's 4 builtin types are:
-- identifier, int, string, constant

module Python
{
    mod = Module(stmt* body, type_ignore* type_ignores)
        | Interactive(stmt* body)
        | Expression(expr body)
        | FunctionType(expr* argtypes, expr returns)

    stmt = FunctionDef(identifier name, arguments args,
                       stmt* body, expr* decorator_list, expr? returns,
                       string? type_comment)
          | AsyncFunctionDef(identifier name, arguments args,
                             stmt* body, expr* decorator_list, expr? returns,
                             string? type_comment)
...

There is the "PEG grammar for Python"/"full grammar specification" ("a mixture of EBNF and PEG") (reference-3.9.1.pdf, Chapter 10, pages 111-120 that looks like this:

# type_expressions allow */** but ignore them
type_expressions:
    | ','.expression+ ',' '*' expression ',' '**' expression 
    | ','.expression+ ',' '*' expression 
    | ','.expression+ ',' '**' expression 
    | '*' expression ',' '**' expression 
    | '*' expression 
    | '**' expression 
    | ','.expression+ 

statements: statement+ 
statement: compound_stmt  | simple_stmts
...

And there are various snippets of "BNF notation" that are "used to describe syntax, not lexical analysis" sprinkled about eg., reference-3.9.1.pdf, pages 65-108 that look like this:

comprehension ::=  assignment_expression comp_for
comp_for      ::=  ["async"] "for" target_list "in" or_test [comp_iter]
comp_iter     ::=  comp_for | comp_if
comp_if       ::=  "if" or_test [comp_iter]

This last bit of "BNF notation" seems the most useful for my purposes, and looks very similar to the "full grammar specification", but it defines all sorts of syntactic types (like e.g. "assignment_expression" and "comp_iter" (reference-3.9.1.pdf, page 67)) which aren't mentioned in either the "full grammar specification" or the "abstract grammar".

What is the relationship between all of these, and, more importantly, is the "BNF notation" that is "used to describe syntax" an actual specification of some sort, or just being used for exposition within the reference document? And, if it is an actual specification, where might it be found, i.e. as a single file that can be conveniently parsed?

Regards!


Solution

  • The grammar used to build Python isn't on your list at all. You can find it in the source bundle as Grammar/python.gram. Note that it has five different start symbols for different compilation contexts (complete files, interactive input, single expressions, etc.), although most of the grammar is shared. This grammar is actually used by the parser generator to produce the parser actually used by the CPython implementation, so it describes perfectly the inputs accepted by the concrete Python interpreter (of course, subject to change in future versions). As noted in the comments near the top of the file, the syntax of the grammar file itself are defined in a PEP.

    But it's not necessarily useful for documentary purposes, and it is not definitive in the sense that it defines the language, in the sense that one might expect from a language standard. Python is not really standardized, so it might be unreasonable to ask for a definitive reference grammar. But the grammar scattered through the Python reference manual is probably as close as you're going to get. Those productions are for expository purposes, though, and they need to be considered along with the accompanying narrative text. (That's generally true of reference grammars, even for standardized grammars, because a context-free grammar cannot capture all the syntactic aspects of any real-life programming language, possibly with a couple of exceptions. So it's usual that the reference grammar for a language will accept a superset of the set of valid programs, and the narrative in the standard provides additional constraints which cannot be expressed in a CFG.)

    The collected grammar at the end of the reference manual is supposedly a summary of the snippets, but as I understand it, it's generated mechanically from the python.gram file. So it's possible that there are divergences between it and the language described in the manual.

    The abstract grammar listed in the ast module documentation actually defines the structure of the abstract syntax tree, so it's not a grammar at all in the usual sense (i.e. a description of a linear sequence of symbols) but rather a collection of structure definitions, each one describing the nature of a typed node in the abstract syntax tree. It corresponds directly to the objects you'll find in an AST built with the ast module, but it doesn't attempt to constrain the syntax of a program being parsed. The text in the library reference is the contents of an actual source file, Parser/Python.asdl, which is mechanically processed into declarations and code used by the CPython parser to produce an AST.