Search code examples
regexsyntaxantlrantlr4grammar

How to capture a literal in antlr4?


I am looking to make a rule for a regex character class that is of the form:

 character_range
   : '[' literal '-' literal ']'
   ;

For example, with [1-5]+ I could match the string "1234543" but not "129". However, I'm having a hard time figuring out how I would define a "literal" in antlr4. Normally I would do [a-zA-Z], but then this is just ascii and won't include something such as é. So how would I do that?


Solution

  • Actually, you don't want to match an entire literal, because a literal can be more than one character. Instead you only need a single character for the match.

    In the parser:

    character_range: OPEN_BRACKET LETTER DASH LETTER CLOSE_BRACKET;
    

    And in the lexer:

    OPEN_BRACKET: '[';
    CLOSE_BRACKET: ']';
    LETTER: [\p{L}];
    

    The character class used in the LETTER lexer rule is Unicode Letters as described in the Unicode description file of ANTLR. Other possible character classes are listed in the UAX #44 Annex of the Unicode Character DB. You may need others like Numbers, Punctuation or Separators for all possible regex character classes.