From "Compilers Principles, Techniques, & Tools, 2nd Ed." ("The Purple Book") by Aho, Lam, Sethi and Ullman:
Figure 3.2: Examples of tokens pg. 112
[Token] [Informal Description] [Sample Lexemes]
if characters i, f if
else characters e, l, s, e else
comparison < or > or <= or >= or == or != <=, !=
id letter followed by letters and digits pi, score, D2
number any numeric constant 3.14159, 0, 6.02e23
literal anything but ", surrounded by "'s "core dumped"
In the above, they separate if
and else
into their own token types. In most examples I've seen, these would be a single keyword
token type and the values of the tokens would be if
or else
. What is the benefit of having separate token types for each keyword rather than a keyword
token type?
What is the benefit of having token types like comparison
? Why not have a token type for each kind of comparison like the following?
[Token] [Informal Description] [Sample Lexemes]
lt < <
gt > >
lte <= <=
gte >= >=
eq == ==
neq != !=
Opinions vary about how individual operators are represented when the operators are syntactically identical. Many people will write separate productions for different operators even though there is no real syntactic difference and the semantic difference is limited.
Having said that, there are languages in which ==
, >=
and <=
are syntactically distinct. In C (and its family), the precedence of these operators differs making it possible to write a <= b == b <= c
without parentheses, although code containing that expression is unlikely to survive a code review. (Even with parentheses, the expression is questionable.) In Perl, a <= b <= c
is a valid cascading comparison, but a <= b == c
is not. Etc.
The general rule is that if a token has a distinct role in the language syntax, the difference must be visible to the parser, and the parser only takes into account the token's type, not its value. For that reason, if
, then
and else
must be different token types in any practical grammar.