Search code examples
c#grammarabstract-syntax-treebnfirony

Include whitespaces when parsing with Irony


I am writing a parser using the following library: https://www.nuget.org/packages/Irony

My current goal is to parse a file that contains lines of plain text. Each line starts with either a whitespace or a tab symbol.

This is how my grammar class looks like:

NonTerminal program = new NonTerminal("program");
NonTerminal textStatement = new NonTerminal("textStatement");
NonTerminal textStatements = new NonTerminal("textStatements");

FreeTextLiteral text = new FreeTextLiteral("text", "\r\n");

KeyTerm whitespace = ToTerm(" ", "whitespace");
KeyTerm tab = ToTerm("  ", "tab");
KeyTerm newline = ToTerm("\n", "newline");

textStatement.Rule = ((whitespace | tab) + text + newline);
textStatements.Rule = MakePlusRule(textStatements, textStatement);

program.Rule = textStatements;
this.Root = program;

And this is the content of a target file (lines are not included):

----------------------
 test

----------------------

Surprisingly, the thing fails on me with the following message:

Column 1, Line 0:
Syntax error, expected: whitespace, tab

It looks like the grammar is configured to skip whitespaces and tabs by default. So, it starts parsing with a "t" letter, having skipped the first " " symbol. This is fine for most cases, but not for this one. I'm trying to write a python-like language, so tracking of whitespaces is important.

I'm not expecting you to write the whole grammar for me, just suggest a generic approach. Any help is appreciated, thanks!

UPD: I ended up overriding 2 functions like this:

    public override bool IsWhitespaceOrDelimiter(char ch)
    {
        if (ch == ' ' || ch == '\t')
            return false;
        return base.IsWhitespaceOrDelimiter(ch);
    }

    public override void SkipWhitespace(ISourceStream source)
    {
        while (!source.EOF())
        {
            switch (source.PreviewChar)
            {
                //case ' ':
                //case '\t':
                //    break;
                case '\r':
                case '\n':
                case '\v':
                    if (UsesNewLine) return;
                    break;
                default:
                    return;
            }
            source.PreviewPosition++;
        }
    }

Solution

  • If you want to handle 'space' as an explicit char in grammar, you need to override IsWhitespaceOrDelimiter method, and for space return false. and same for tab and other chars