Let's say I have a simple syntax where you can assign a number to an identifier using =
sign.
I can write the parser in two ways.
I can include the character token =
directly in the rule or I can create a named token for it and use the lexer to recognize it.
Option #1:
// lexer
[A-Za-z_][A-Za-z_0-9]* { return IDENTIFIER; }
[0-9]+ { return NUMBER; }
// parser
%token IDENTIFIER NUMBER
%%
assignment : IDENTIFIER '=' NUMBER ;
Option #2:
// lexer
[A-Za-z_][A-Za-z_0-9]* { return IDENTIFIER; }
[0-9]+ { return NUMBER; }
= { return EQUAL_SIGN; }
// parser
%token IDENTIFIER NUMBER EQUAL_SIGN
%%
assignment : IDENTIFIER EQUAL_SIGN NUMBER ;
Both ways of writing the parser work and I cannot quite find an information about good practices concerning such situation. The first snippet seems to be more readable but this is not my highest concern.
Is any of these two options faster or beneficial in other way? Are they technical reasons (other than readability) to prefer one over another?
Is there maybe a third, better way?
I'm asking about problems concerning huge parsers, where optimization may be a real issue, not just such toy examples as is shown here.
Aside from readability, it basically makes no difference. There really is no optimisation issue, no matter how big your grammar is. Once the grammar has been compiled, tokens are small integers, and one small integer is pretty well the same as any other small integer.
But I wouldn't underrate the importance of readability. For one thing, it's harder to see bugs in unreadable code. It's surprisingly common for a grammar bug to be the result of simply typing the wrong name for some punctuation character. It's much easier to see that '{' expr '{'
is wrong than if it were written T_LBRC expr T_LBRC
, and furthermore the named symbols are much harder to interpret for someone whose first language isn't English.
Bison's parse table compression requires token numbers to be consecutive integers, which is done by passing incoming token codes through a lookup table, hardly a major overhead. Not using character codes doesn't avoid this lookup, though, because the token numbers 1 through 255 are reserved anyway.
However, Bison's C++ API using named token constructors requires token names and single-character token codes cannot be used as token names (although they're not forbidden, since you're not required to use named constructors).
Given that use case, Bison recently introduced an option which generates consecutively numbered token codes in order to avoid the recoding; this option is not compatible with single-character tokens being represented as themselves. It's possible that not having to recode the token is a marginal speed-up, but it's hard to believe that it's significant, but if you're not going to use single-quoted tokens, you might as well go for it.
Personally, I don't think the extra complexity is justified, at least for the C API. If you do choose to go with token names, perhaps in order to use the C++ API's named constructors, I'd strongly recommend using Bison aliases in order to write your grammar with double-quoted tokens (also recommended for multi-character operator and keyword tokens).