The syntax I'm trying to parse includes a continuation indicator in column 71. Identifiers, literals, almost anything can be continued onto the next line.
Ideally, I would like to drop the characters which make up the continue token, so that I'm left with only the identifier characters. However, using the following lexer rules, the 'setText("")' in LINE_CONTINUATION is ignored, thus polluting the final IDENTIFIER token.
IDENTIFIER
:
{getCharPositionInLine() < 71 }? IDENTIFIER_PART
(
{getCharPositionInLine() < 71 }? IDENTIFIER_PART
| LINE_CONTINUATION
)*
;
fragment IDENTIFIER_PART: (LETTER|DIGIT|'_');
fragment DIGIT: [0-9];
fragment LETTER options { caseInsensitive=true; } : [A-Z];
//A continuation line is non-blank in column 72, followed by anything until EOL,
//then on next line the characters starting after column position 15
LINE_CONTINUATION
:
{getCharPositionInLine() == 71 }?
~[ ]
~[\r\n]* EOL
({getCharPositionInLine() <= 15 }? [ ] )+
{setText("");}
;
Is there anyway of overriding the value of a subrule (or fragment) in the same way that root rules can be overridden?
For example, there could be a list of identifiers defined as:
AAAAAAAAAAAA,BBBBBBBBBBB,CCCCCCCCCCCCCCCCC,DDDDDDDDDDD,EEEEEEEEEE,FFFF* Some comment
FFFF,GGGGGGGG
I'm trying to get tokens with text:
AAAAAAAAAAAA
BBBBBBBBBBB
CCCCCCCCCCCCCCCCC
DDDDDDDDDDD
EEEEEEEEEE
FFFFFFFF
GGGGGGGG
However I get:
AAAAAAAAAAAA
BBBBBBBBBBB
CCCCCCCCCCCCCCCCC
DDDDDDDDDDD
EEEEEEEEEE
FFFF* Some comment\nFFFF
GGGGGGGG
That is not possible. You will have to do the setText(…)
inside your IDENTIFIER
rule. Try something like this (untested):
IDENTIFIER
: {getCharPositionInLine() < 71 }? IDENTIFIER_PART
( {getCharPositionInLine() < 71 }? IDENTIFIER_PART
| LINE_CONTINUATION
)*
{
String text = getText();
setText(text.replaceAll(“\\S[^\r\n]*[\r\n]+[ ]{0,15}”, “”));
}
;