Search code examples
compiler-constructioncompiler-optimizationlexical-analysisjavacc

Is there some difference between use Tokens or strings in compiler construction?


I was trying to optimize a compiler in JavaCC but then came across something that I never found while working with compilers in the past because I was taught to use tokens for any terminal.

This compiler sometimes uses strings in regular expressions of syntactic analysis instead of tokens, for example:

<TK_IF> "(" log_expr ")" body

instead of:

<TK_IF> <TK_LPAREN> log_expr <TK_RPAREN> body

This is just an example, in other parts of the code are used strings in operators like (+, -, !=, ==, >, <).

What I want to know is if there is some difference between use tokens or strings in the compiler, mainly about performance that is my goal optimizing it.


Solution

  • The answer is in the FAQ.

    Say I have two definitions (or lexical productions in JavaCC terminology)

    TOKEN : { <ID : (["a"-"z","A"-Z"])+ > 
          |   <BECOMES : ":=" >}
    

    This defines a token kind named ID. It corresponds to infinitely many strings that might appear in the input file: apple, pear, fruitBasket. It also defines a token kind BECOMES that can only appear as the string :=.

    In your BNF productions, you need to refer to token kinds. So a BNF production might be

    void assignment() : {} { <ID> <BECOMES> expression() }
    

    But ---as explained in the FAQ--- since the BECOMES token kind can only refer to the string := --- and is presumably the only such token kind, JavaCC lets you write this BNF production as

    void assignment() : {} { <ID> ":=" expression() }
    

    The two ways of writing the production are identical.

    In your case, the strings "(" and ")" in nonterminal productions simply abbreviate <TK_LPAREN> and <TK_RPAREN>, respectively.

    This abbreviation can only be used in BNF productions.