Search code examples
regexparsinggrammarebnfgrako

Whitespace handling in grako when regular expressions are involved


I'm trying to write a grako flavored ebnf grammar. I noticed the generated parser does not seem to advance over whitespaces or comments, when trying to parse a regular expression.

The documentation says the following on that topic

Unlike other expressions, this one does not advance over whitespace or comments. For that, place the regexp as the only term in its own rule.

I then created a simple grammar with only one regexp-rule. The regex is also the only term within that rule.

@@eol_comments :: ?/(#[^\r\n]*)|(\/\/[^\r\n]*)/?
@@comments :: ?/\s*\/\*(.|[\r\n])*?\*\//?

Start     = NameList $;
NameList  = { Name } ;
Name      = /[a-zA-Z_][a-zA-Z0-9_]+/ ;

The generated parser fails on the inputs " abc\ndef" and "abc\ndef". The first one at the very beginning the second one at the first newline, space or comment.

It only occurs with regular expressions, other rules work fine e.g. If name is defined like

Name      = 'abc' | 'def' ;

Then everything is ok and the above inputs successfully parse.

How can I change the behavior such that the grammar advances over whitespaces and comments?

Additional Info:

traces of the above inputs:

<Start
<1:1>abc

<NameList<Start
<1:1>abc

<Name<NameList<Start
<1:1>abc

>'abc' /[a-zA-Z_][a-zA-Z0-9_]+/
<1:4>

>Name<NameList<Start
<1:4>

<Name<NameList<Start
<1:4>

!'' /[a-zA-Z_][a-zA-Z0-9_]+/
<1:4>

>NameList<Start
<1:4>

!Start
<1:1>abc

and

<Start
<1:1> abc

<NameList<Start
<1:1> abc

<Name<NameList<Start
<1:1> abc

!'' /[a-zA-Z_][a-zA-Z0-9_]+/
<1:1> abc

>NameList<Start
<1:1> abc

!Start
<1:1> abc

I generated the parser using the following command:

grako --generate-parser --outfile parser.py test.ebnf

and I've also tried specifying whitespaces using the -w option (/\s+/ and /[ \t\n\r]+/ but that did not change the behavior)

And started the parser using:

python parser.py eztest.txt Start -t

Solution

  • Rule names that start with an uppercase letter are special in Grako. As the documentation explains, they do not advance over whitespace before starting to parse.

    Change the rule names in your grammar so they start with a lowercase letter, and it should be fine.

    Why not leave the choice of camel-case or Python-style rule names to the user?

    • It was a simple and an easy-to-implement design choice that allows for great flexibility on the lexical aspects of a language
    • It was expected that Python programmers would be comfortable with Python-style names
    • The tradition in computerized grammars and parsers is to use lower case for rule names