Search code examples
parsingcompiler-constructionlanguage-agnostic

What differentiates syntax analysis and semantic analysis?


To my understanding, Parser is composed of three stages of lexical, syntactic and semantic analysis.

  1. Lexical: It will split my input into tokens. Example: 123+100-0 -> 123 + 100 - 0

  2. Syntactic: It will look into the tokens and check if they make sense with each other.

The problem i am having is understanding the final phase "semantic parsing" and how it differentiates to the second stage "syntactic analysis". To my understanding the final stage "semantic parsing" also validates the tokens that has been validated by "syntactic analysis" and then prints out the output.


Solution

  • Generally speaking, the goal of syntax analysis is to turn a stream of tokens (generated by lexical analysis) into some sort of structured representation of what the program "means." For example, in syntax analysis, you'd expect the tokens a, /, and b to be converted into something that says "the quotient of a and b." Typically, the output of syntax analysis is an abstract syntax tree, which is a tree structure encoding the structure of the program.

    The job of semantic analysis is to make sure that the representation generated by the syntax analysis step actually "makes sense." For example, suppose that the syntax analyzer has converted a, /, and b into "the quotient of a and b." At the time the syntax analyzer did this, it probably had no conception of what the types of a and b were; more likely, the syntax analyzer simply saw that this could be interpreted as a mathematical expression and left things at that. However, it's entirely possible that this is not a well-defined operation. For example, if a is a string and b is a boolean, then a / b is a nonsense operation and the compiler should report an error.

    For a more elaborate example, suppose you're implementing a Java compiler and you see the tokens public, class, A, extends, B, {, }. You could interpret this to be a class definition during syntax analysis, since it's a syntactically-valid class. But what if B, which is defined later in the file, isn't a class? Or what if B isn't defined at all? In that case, the code is incorrect. It's syntactically valid in the sense that it conceivably could be legitimate code, but it's semantically incorrect in that there isn't a class B to be found. The syntax analyzer would therefore produce a totally valid AST, while the semantic analyzer would then reject the code after it found that B didn't exist.

    There's no fundamental reason why we have to keep these two steps separated, and in fact some compilers meld the two together. The main reason for doing this is practical - we have some great tools for building syntax analyzers (LR parsers, LL parsers, etc.), and they work by interpreting sequences of tokens without thinking too much about what they mean. The compiler author can then specify the general syntax of the language, then write a semantic analyzer as a second pass that looks over the AST and determines whether it's valid. This makes the separation between what the parser generators can do automatically and what the programmer needs to write code for cleaner.

    Hope this helps!