I was trying to optimize a compiler in JavaCC but then came across something that I never found while working with compilers in the past because I was taught to use tokens for any terminal.
This compiler sometimes uses strings in regular expressions of syntactic analysis instead of tokens, for example:
<TK_IF> "(" log_expr ")" body
instead of:
<TK_IF> <TK_LPAREN> log_expr <TK_RPAREN> body
This is just an example, in other parts of the code are used strings in operators like (+, -, !=, ==, >, <).
What I want to know is if there is some difference between use tokens or strings in the compiler, mainly about performance that is my goal optimizing it.
The answer is in the FAQ.
Say I have two definitions (or lexical productions in JavaCC terminology)
TOKEN : { <ID : (["a"-"z","A"-Z"])+ >
| <BECOMES : ":=" >}
This defines a token kind named ID
. It corresponds to infinitely many strings that might appear in the input file: apple
, pear
, fruitBasket
. It also defines a token kind BECOMES
that can only appear as the string :=
.
In your BNF productions, you need to refer to token kinds. So a BNF production might be
void assignment() : {} { <ID> <BECOMES> expression() }
But ---as explained in the FAQ--- since the BECOMES
token kind can only refer to the string :=
--- and is presumably the only such token kind, JavaCC lets you write this BNF production as
void assignment() : {} { <ID> ":=" expression() }
The two ways of writing the production are identical.
In your case, the strings "(" and ")" in nonterminal productions simply abbreviate <TK_LPAREN>
and <TK_RPAREN>
, respectively.
This abbreviation can only be used in BNF productions.