Search code examples
parsingcompiler-constructionlexical-analysis

Separator, Comparison, Operator Token Types


From "Compilers Principles, Techniques, & Tools, 2nd Ed." ("The Purple Book") by Aho, Lam, Sethi and Ullman:

Figure 3.2: Examples of tokens pg. 112

[Token]       [Informal Description]                  [Sample Lexemes]
if            characters i, f                         if
else          characters e, l, s, e                   else
comparison    < or > or <= or >= or == or !=          <=, !=
id            letter followed by letters and digits   pi, score, D2
number        any numeric constant                    3.14159, 0, 6.02e23
literal       anything but ", surrounded by "'s       "core dumped"

In the above, they separate if and else into their own token types. In most examples I've seen, these would be a single keyword token type and the values of the tokens would be if or else. What is the benefit of having separate token types for each keyword rather than a keyword token type?

What is the benefit of having token types like comparison? Why not have a token type for each kind of comparison like the following?

[Token]       [Informal Description]                  [Sample Lexemes]
lt            <                                       <
gt            >                                       >
lte           <=                                      <=
gte           >=                                      >=
eq            ==                                      ==
neq           !=                                      !=

Solution

  • Opinions vary about how individual operators are represented when the operators are syntactically identical. Many people will write separate productions for different operators even though there is no real syntactic difference and the semantic difference is limited.

    Having said that, there are languages in which ==, >= and <= are syntactically distinct. In C (and its family), the precedence of these operators differs making it possible to write a <= b == b <= c without parentheses, although code containing that expression is unlikely to survive a code review. (Even with parentheses, the expression is questionable.) In Perl, a <= b <= c is a valid cascading comparison, but a <= b == c is not. Etc.

    The general rule is that if a token has a distinct role in the language syntax, the difference must be visible to the parser, and the parser only takes into account the token's type, not its value. For that reason, if, then and else must be different token types in any practical grammar.