Search code examples
stringantlrstringescapeutils

AntLR - String Recognition Error


I have an ANTLR grammar file with the string definition as below

STRING
:  '"' (EscapeSequence | ~('\\'|'"') )* '"' ;
fragment EscapeSequence
  :   '\\' .
;

But this Lexer rule ignore the escape character at the first instance of the quotes. The

id\=\"

is recognized as the start of the string whereas there is a preceding escape character. this is happening only for the first quote. All the subsequent quotes, if escaped, are recognized properly.

/id\=\"Testing\" -- Should not be a string as both quotes are escaped
/id\="Testing" -- Should be a string between the quotes, since they are not escaped

The main problem to solve is to avoid the lexer from trying to recognize a string if the character (only the last one character) preceding a quote is an escape character. If there are multiple escape characters, I need to consider just one character before the starting quote.


Solution

  • ANTLR will automatically provide the behavior you desire in almost every situation. Consider the following input:

    /id\=\"Testing\"
    

    The critical requirement involves the location and length of the token preceding the first quote character. In the following block I add spaces only for illustrating conditions that occur between characters.

    / i d \ = \ " T e s t i n g \ "
               ^
               |
               ----------- Make sure no token can *end* here
    

    By ensuring that the first " character is included as part of the token which also includes the \ character before it, you ensure that the first " character will never be interpreted as the start of a STRING token.

    If the above condition is not met, your " character will be treated as the start of a STRING token.