Search code examples
javaparsingvisual-studio-codeantlrantlr4

Antlr rule for Digit is not recognizing digits -


I'm trying to extend an existing grammar using Antlr4. In the .g4 file beside other rules the following is defined:

Digit
 :  ZeroDigit
     | NonZeroDigit
     ;

NonZeroDigit
            :  NonZeroOctDigit
                | '8'
                | '9'
                ;

NonZeroOctDigit
               :  '1'
                   | '2'
                   | '3'
                   | '4'
                   | '5'
                   | '6'
                   | '7'
                   ;

OctDigit
        :  ZeroDigit
            | NonZeroOctDigit
            ;

ZeroDigit
         :  '0' ;


SP
  :  ( WHITESPACE )+ ;

so on top of that (not only as a figure of speech) I added the following rules which are supposed to make use of these existing rules:

ttQL_Query
     : ttQL_TimeClause SP;

ttQL_TimeClause
     : FROM SP? ttQL_DateTime SP? TO SP? ttQL_DateTime; 

ttQL_DateTime
    : ttQL_Date ('T' ttQL_Time ttQL_Timezone)?;

ttQL_Timezone: 'Z' | ( '+' | '-' ) ttQL_Hour ':' ttQL_Minute; 

ttQL_Date: ttQL_Year '-' ttQL_Month '-' ttQL_Day;
ttQL_Time: ttQL_Hour (':' ttQL_Minute (':' ttQL_Second (ttQL_Millisecond)?)?)?;

ttQL_Year: Digit Digit Digit Digit;
ttQL_Month: Digit Digit;
ttQL_Day: Digit Digit;

ttQL_Hour: Digit Digit ;
ttQL_Minute: Digit Digit ;
ttQL_Second: Digit Digit ;
ttQL_Millisecond: '.' ( Digit )+;


FROM : ( 'F' | 'f' ) ( 'R' | 'r' ) ( 'O' | 'o' ) ( 'M' | 'm' ) ;
TO : ( 'T' | 't' ) ( 'O' | 'o' ) ;

This is supposed to be an extension of the open cypher query language (grammar can be found here: http://opencypher.org/resources/) but i dont get it to work. Its supposed to prefix a cypher query. The rule for that is simple:

ttQL
     : SP? ttQL_Query SP? oC_Cypher ;

So all the other existing rules as well as the one i stated in the beginning are used in oC_Cypher. I put all my rules on top of the antlr file and when trying to parse a query like the following:

FROM 2123-12-13T12:34:39Z TO 2123-12-13T14:34:39.2222Z MATCH (a)-[x]->(b) WHERE a.ping > 22" RETURN a.ping, b"

I get the following error messages by my parser:

line 1:5 mismatched input '2123' expecting Digit
line 1:10 mismatched input '12' expecting Digit
line 1:13 mismatched input '13' expecting Digit
line 1:29 mismatched input '2123' expecting Digit
line 1:34 mismatched input '12' expecting Digit
line 1:37 mismatched input '13' expecting Digit

The weird thing is, when i put my part of the grammar in a new .g4 file and create a parser only for the prefix part FROM 2123-12-13T12:34:39Z TO 2123-12-13T14:34:39.2222Z then everything works like a charm. I'm kind of lost here. I am using vscode, java, maven and the ANTLR4 Plugin with ANTLR version 4.9.2, mvn-compiler-plugin 3.10.1, java version 11

what could be the catch here ?


Solution

  • With the help of the answers of kaby I could solve the problem for me. I don't know if this is the correct of handling this issue but for what I want to achieve it is sufficient. So please be careful with this solution if you have a similar problem and try to solve it.

    As kaby noted the lexer seaches for the Token it can concatenate the most characters with, so i just made lexer rules out of the date and time so the numbers wouldnt get recognized as Integers. Here is my working solution:

    ttQL_Query
         : ttQL_TimeClause SP?;
    
    ttQL_TimeClause
         : FROM SP? DATETIME SP? TO SP? DATETIME; 
    
    DATETIME:  DATE ('T' TIME TIMEZONE)?;
    
    TIMEZONE: 'Z' | ( '+' | '-' ) Digit Digit ':' Digit Digit; 
    
    DATE: Digit Digit Digit Digit '-' Digit Digit '-' Digit Digit;
    TIME: Digit Digit (':' Digit Digit (':' Digit Digit ('.' (Digit)+ )?)?)?;
    
    
    FROM : ( 'F' | 'f' ) ( 'R' | 'r' ) ( 'O' | 'o' ) ( 'M' | 'm' ) ;
    TO : ( 'T' | 't' ) ( 'O' | 'o' ) ;
    

    EDIT:

    I discovered my solution contains another pitfall which I will add here. In case you are parsing integers or any other sequence of digits where it is possible that two digits are concatenated my TIME rule will be invoked and a TIME token will be created - at least if this rule is above other rules which could fit here. As someone who dealt the first time with lexers and parsers I found that it is most important to be careful about already existing Lexer rules. As kaby mentioned: take care about the Lexer first, print out the tokens of Example input for debugging. In my case a simple solution was to merge the DATE, TIME and TIMEZONE rules to make a more unique rule to not run into compatibility issues with the existing Lexer rules:

    DATETIME:  (Digit Digit Digit Digit '-' Digit Digit '-' Digit Digit) ('T' (Digit Digit (':' Digit Digit (':' Digit Digit ('.' (Digit)+ )?)?)?) ('Z' | ( '+' | '-' ) Digit Digit ':' Digit Digit))?;