antlr4

Prevent two equal lexer tokens following each other


I have following grammar:

lexer grammar TestLexer;

Number
    : '-'? [0-9]+
    ;

Punctuation
    : [\-.]
    ;

Identifier
    : '.'? [a-zA-Z]+
    ;

Whitespace
    : [ \t]+
      -> skip
    ;

Newline
    : ( '\r' '\n'?
      | '\n'
      )
      -> skip
    ;

and following input file:

1-2
1 -2

.foo
foo.bar

This produces

[@0,0:0='1',<Number>,1:0]
[@1,1:2='-2',<Number>,1:1]
[@2,5:5='1',<Number>,2:0]
[@3,7:8='-2',<Number>,2:2]
[@4,13:16='.foo',<Identifier>,4:0]
[@5,19:21='foo',<Identifier>,5:0]
[@6,22:25='.bar',<Identifier>,5:3]
[@7,28:27='<EOF>',<EOF>,6:0]

What I need to change that 1-2 will be recognized as Number, Punctuation, Number and foo.bar as Identifier, Punctuation, Identifier?


Solution

  • I could solve this problem with semantic predicates:

    lexer grammar X86AsmLexer;
    
    Number
        : { _input.LA(-1) < '0' || _input.LA(-1) > '9' }? '-'? [0-9]+
        ;
    
    Punctuation
        : [\-.]
        ;
    
    Identifier
        : { _input.LA(-1) < 'a' || _input.LA(-1) > 'z' }? '.'? [a-zA-Z]+
        ;
    
    Whitespace
        : [ \t]+
          -> skip
        ;
    
    Newline
        : ( '\r' '\n'?
          | '\n'
          )
          -> skip
        ;
    
    LineComment
        : ';' ~[\r\n]*
        ;
    

    Now the test file

    1-2
    1 -2
    1- 2
    
    .foo
    foo.bar
    

    is lexed as expected:

    [@0,0:0='1',<Number>,1:0]
    [@1,1:1='-',<Punctuation>,1:1]
    [@2,2:2='2',<Number>,1:2]
    [@3,5:5='1',<Number>,2:0]
    [@4,7:8='-2',<Number>,2:2]
    [@5,11:11='1',<Number>,3:0]
    [@6,12:12='-',<Punctuation>,3:1]
    [@7,14:14='2',<Number>,3:3]
    [@8,19:22='.foo',<Identifier>,5:0]
    [@9,25:27='foo',<Identifier>,6:0]
    [@10,28:28='.',<Punctuation>,6:3]
    [@11,29:31='bar',<Identifier>,6:4]
    [@12,34:33='<EOF>',<EOF>,7:0]