Search code examples
regexparsingantlr4symbolslexer

ANTLR4 : problems with parsing symbols like minus and dot in both numbers and strings


I'm using ANTLR4 and trying to parse '.' in both strings and numbers. I have the 2 following statements I want to be able to parse, of which I can only successfully parse one or the other, but not both:

str = "sentence. more.";
num = 5.5;

I'm using the following grammar file:

grammar TestGrammar;

// Parser rules
rule : statement*;

statement : assignmentStatement;

assignmentStatement : identifier '=' numberLiteral ';' 
                    | identifier '=' stringLiteral ';'
                    ;

numberLiteral : MINUS? NUM* DOT? NUM+;
stringLiteral : '"' string '"';

string : (SYMBOL | CHAR | NUM)+;
identifier : UNDERSCORE? (CHAR | NUM)+;

// Lexer rules

WS : (' ' | '\t' | '\r' | '\n')+ -> skip;

CHAR : [a-zA-Z];
NUM : [0-9];
SYMBOL : [<>#=&|!(){}.\-:;];

MINUS : '-';
UNDERSCORE : '_';
DOT : '.';

There is overlap between the DOT and SYMBOL lexer rules. Currently, it is able to parse the string but not the number. If I change the rule order so that DOT is before SYMBOL, it can then parse the number but not the string.

I tried directly using '.'? in the numberLiteral rule and removing DOT rule, which still caused problems when matching strings. I've also tried changing the SYMBOL rule to something like the following:

SYMBOL : [<>#=&|!(){}DOT\-:;];

Maybe I have the syntax wrong but that cannot parse the string correctly. How can I change my grammar file so that it can successfully parse both? I was having similar problems with MINUS. Thanks!


Solution

  • You'll want to create Number and String Literals as tokens (if you want to name the sub-parts, you can use fragments, but understand that you'll only get STRING or NUMBER tokens. There are not tokens generated for fragment matching and tokens do not have tokens as components, they are leaf nodes on the parse tree.

    Try this:

    grammar TestGrammar
        ;
    
    // Parser rules
    rule: statement*;
    
    statement: assignmentStatement;
    
    assignmentStatement
        : IDENTIFIER '=' NUMBER ';'
        | IDENTIFIER '=' STRING ';'
        ;
    
    NUMBER: MINUS? NUM* DOT? NUM+;
    STRING: '"' (SYMBOL | CHAR | NUM | SPACE)+ '"';
    
    //string: (SYMBOL | CHAR | NUM)+;
    
    // Lexer rules
    IDENTIFIER: UNDERSCORE? (CHAR | NUM)+;
    
    fragment CHAR:   [a-zA-Z];
    fragment NUM:    [0-9];
    fragment SYMBOL: [<>#=&|!(){}.\-:;];
    fragment SPACE: ' ';
    
    fragment MINUS:      '-';
    fragment UNDERSCORE: '_';
    fragment DOT:        '.';
    
    WS: [ \t\r\n]+ -> skip;