Search code examples
parsingabstract-syntax-treesemantic-analysis

Is it common to have two semantic analysis phases within compiler construction?


I've been studying the grammar and AST nodes for various languages with AST explorer. With python, I've noticed that some form of semantic analysis is taking place during the parsing process. E.g.

x = 2
x = 2

Yields the following AST consisting of a VariableDeclaration node and ExpressionStatement node.

Python AST

So when the first x = 2 line is parsed, it checks a symbol table for the existence of x and then registers it and produces a VariableDeclaration node. Then when the second x = 2 line is parsed, it finds that x is already defined and produces a ExpressionStatement node.

However when I attempt to use the following semantically incorrect code:

2 + "string"

It accepts the code, and produces a ExpressionStatement node - even though it is semantically incorrect i.e. int + string, and rightly so produces an error when I attempt to execute it with a python interpreter.

This suggests to me that semantic analysis takes place twice: once during the parsing process and once again while traversing the complete AST. Is this assumption correct? If so why is this the case? Wouldn't it be more simple to do the entire semantic analysis phase during parsing instead of splitting it up?


Solution

  • The semantic error in the statement 2 + "string" is not detected in any semantic pass whatsoever. It is a runtime error and it is reported when you attempt to execute the statement. If the statement is never executed, no error is reported, as you can see by executing the script

        if False:
            2 + "string"
        print("All good!")
    

    Resolving the first use of a global variable as a declaration is more of an optimisation than anything else, and it is common for compilers to execute multiple optimisation passes.

    There is always a temptation to try to combine these multiple passes but it is a false economy: walking an AST is relatively low overhead, and code is much clearer and more maintainable when it only attempts to do one thing. Intertwining two unrelated optimization heuristics is poor design, just as is intertwining any set of unrelated procedures.