ANTLR4 - How to match something until two characters match?

Flutter:

Framework • revision 18116933e7 (vor 8 Wochen) • 2021-10-15 10:46:35 -0700
Engine • revision d3ea636dc5
Tools • Dart 2.14.4

Antrl4:
antlr4: ^4.9.3

I would like to implement a simple tool that formats text like in the following definition: https://www.motoslave.net/sugarcube/2/docs/#markup-style

So basically each __ is the start of an underlined text and the next __ is the end.

I got some issues with the following input:

^^subscript=^^

Shell: line 1:13 token recognition error at '^'
Shell: line 1:14 extraneous input '' expecting {'==', '//', '''', '__', '~~', '^^', TEXT}

MyLexer.g4:


STRIKETHROUGH : '==';
EMPHASIS : '//';
STRONG : '\'\'';
UNDERLINE : '__';
SUPERSCRIPT : '~~';
SUBSCRIPT : '^^';

TEXT
 : ( ~[<[$=/'_^~] | '<' ~'<' | '=' ~'=' | '/' ~'/' | '\'' ~'\'' | '_' ~'_' | '~' ~'~' | '^' ~'^' )+
;

MyParser.g4:


options {
    tokenVocab=SugarCubeLexer;
    //language=Dart;
}

parse
 : block EOF
 ;

block
 : statement*
 ;

statement
 : strikethroughStyle
 | emphasisStyle
 | strongStyle
 | underlineStyle
 | superscriptStyle
 | subscriptStyle
 | unstyledStatement
 ;

unstyledStatement
 : plaintext
 ;

strikethroughStyle
 : STRIKETHROUGH (emphasisStyle | strongStyle | underlineStyle | superscriptStyle | subscriptStyle | unstyledStatement)* STRIKETHROUGH
 ;

emphasisStyle
 : EMPHASIS (strikethroughStyle | strongStyle | underlineStyle | superscriptStyle | subscriptStyle | unstyledStatement)* EMPHASIS
 ;

strongStyle
 : STRONG (strikethroughStyle | emphasisStyle | underlineStyle | superscriptStyle | subscriptStyle | unstyledStatement)* STRONG
 ;

underlineStyle
 : UNDERLINE (strikethroughStyle | emphasisStyle | strongStyle | superscriptStyle | subscriptStyle | unstyledStatement)* UNDERLINE
 ;

superscriptStyle
 : SUPERSCRIPT (strikethroughStyle | emphasisStyle | strongStyle | underlineStyle | subscriptStyle | unstyledStatement)* SUPERSCRIPT
 ;

subscriptStyle
 : SUBSCRIPT (strikethroughStyle | emphasisStyle | strongStyle | underlineStyle | superscriptStyle | unstyledStatement)* SUBSCRIPT
 ;

plaintext
 : TEXT
 ;

I would be super happy for any help. Thanks

Solution

It's you TEXT rule:

TEXT
    : (
        ~[<[$=/'_^~]
        | '<' ~'<'
        | '=' ~'='
        | '/' ~'/'
        | '\'' ~'\''
        | '_' ~'_'
        | '~' ~'~'
        | '^' ~'^'
    )+
    ;

You can't write a Lexer rule in ANTLR like you're trying to do (i.e. a '^' unless it's followed by another '^'). The ~'^' means "any character that's not ^")

if you run your input through grun with a -tokens option, you'll see that the TEXT token pulls everything through the EOL

[@0,0:1='^^',<'^^'>,1:0]
[@1,2:14='subscript=^^\n',<TEXT>,1:2]
[@2,15:14='<EOF>',<EOF>,2:0]

Try something like this:

grammar MyParser
    ;

parse: block EOF;

block: statement*;

statement
    : STRIKETHROUGH statement STRIKETHROUGH # Strikethrough
    | EMPHASIS statement EMPHASIS           # Emphasis
    | STRONG statement STRONG               # Strong
    | UNDERLINE statement UNDERLINE         # Underline
    | SUPERSCRIPT statement SUPERSCRIPT     # SuperScript
    | SUBSCRIPT statement SUBSCRIPT         # Subscript
    | plaintext                             # unstyledStatement
    ;

plaintext: TEXT+;

STRIKETHROUGH: '==';
EMPHASIS:      '//';
STRONG:        '\'\'';
UNDERLINE:     '__';
SUPERSCRIPT:   '~~';
SUBSCRIPT:     '^^';

TEXT: .;

This grammar correctly parses your input, but at the expense of turning everything other than your special characters into single character tokens.

With a bit more thought, we can minimize this:

grammar MyParser
    ;

parse: block EOF;

block: statement*;

statement
    : STRIKETHROUGH statement STRIKETHROUGH # Strikethrough
    | EMPHASIS statement EMPHASIS           # Emphasis
    | STRONG statement STRONG               # Strong
    | UNDERLINE statement UNDERLINE         # Underline
    | SUPERSCRIPT statement SUPERSCRIPT     # SuperScript
    | SUBSCRIPT statement SUBSCRIPT         # Subscript
    | (U_TEXT | TEXT)+                      # unstyledStatement
    ;

STRIKETHROUGH: '==';
EMPHASIS:      '//';
STRONG:        '\'\'';
UNDERLINE:     '__';
SUPERSCRIPT:   '~~';
SUBSCRIPT:     '^^';

U_TEXT: ~[=/'_~^]+;
TEXT:   .;

This adds the U_TEXT lexer rule. This rule will pull together all unambiguous characters into a single token. This significantly reduces the number of tokens produced. (as well as the number of diagnostic warnings). It should perform much better than the first (I've not tried/timed it on large enough input to see the difference, but the resulting parse tree is much better.

Elaboration:

The ANTLR lexer rule evaluation works by examining your input. When multiple rules could match the next n characters of input, then it will continue looking at input characters until a character fails to match any of the "active" lexer rules. This establishes the longest run of characters that could match a lexer rule. If this is a single rule, it wins (by virtue of having matched the longest sequence of input characters). If there is more than one rule matching the same run of input characters then the lexer matches the first of those rules to appear in your grammar. (Technically, these situations are "ambiguities", as, looking at the whole grammar, there are multiple ways that ANTLR could have tokenized it. But, since ANTLR has deterministic rules for resolving these ambiguities, they're not really a problem.)

Lexer rules, just don't have the ability to use negation except for negating a set of characters (that appear between [ and ]). That means we can't write a rule to match a "< not followed by another <". We can match "<<" as a longer token than "<". To do that, we have to ensure that all tokens that could start one of your two character sequences, match a single token rule. However, we want to avoid making ALL other characters single character rules, so we can introduce a rules that is "everything but on our our special characters". This will greedily consume everything that isn't possibly "problematic". Leaving only the special characters to be caught by the single character `'.'`` rule at the end of the grammar.