Search code examples
parsingantlr4lexical-analysis

Antlr Matlab grammar lexing conflict


I've been using the Antlr Matlab grammar from Antlr grammars

I found out I need to implement the ' Matlab operator. It is the complex conjugate transpose operator, used as such

result = input'

I tried a straightforward solution of adding it to unary_expression as an option postfix_expression '\''

However, this failed to parse when multiple of these operators were used on a single line.

Here's a significantly simplified version of the grammar, still exhibiting the exact problem:

grammar Grammar;

unary_expression
   : IDENTIFIER
   | unary_expression '\''
   ;

translation_unit : unary_expression CR ;

STRING_LITERAL : '\'' [a-z]* '\'' ;

IDENTIFIER : [a-zA-Z] ;

CR : [\r\n] + ;

Test cases, being parsed as translation_unit:

"x''\n" //fails getNumberOfSyntaxErrors returns 1
"x'\n" //passes

The failure also prints the message line 1:1 extraneous input '''' expecting CR to stderr.

The failure goes away if I either remove STRING_LITERAL, or change the * to +. Neither is a proper solution of course, as removing it is entirely off the table, and mandating non-empty strings is not quite correct, though I might be able to live with it. Also, forcing non-empty string does nothing to help the real use case, when the input is something like x' + y' instead of using the operator twice.

For some reason removing CR from the grammar and \n from the tests also makes the parsing run without problems, but yet again is not a useable solution.

What can I do to the grammar to make it work correctly? I'm assuming it's a problem with lexing specifically because removing STRING_LITERAL or making it unable to match '' makes it go away.


Solution

  • The lexer can never be made that context aware I think, but I don't know Matlab well enough to be sure. How could you check during tokenisation that these single quotes are operators:

    x' + y';
    

    while these are strings:

    x = 'x' + ' + y';
    

    ?

    Maybe you can do something similar as how in ECMAScript a / can be a division operator or a regex delimiter. In this grammar that is handled by a predicate in the lexer that uses some target code to check this.

    If something like the above is not possible, I see no other way than to "promote" the creation of strings to the parser. That would mean removing STRING_LITERAL and introducing a parser rule that matches something like this:

    string_literal
     : QUOTE ~(QUOTE | CR)* QUOTE
     ;
    
    // Needed to match characters inside strings
    OTHER
     : .
     ;
    

    However, that will fail when a string like 'hi there' is encountered: the space in between hi and there will now be skipped by the WS rule. So WS should also be removed (spaces will then get matched by the OTHER rule). But now (of course) all spaces will litter the token stream and you'll have to account for them in all parser rules (not really a viable solution).

    All in all: I don't see ANTLR as a suitable tool in this case. You might look into parser generators where there is no separation between tokenisation and parsing. Google for "PEG" and/or "scannerless parsing".