Search code examples
context-free-grammarpegjs

How do I write a grammar for this (negative lookaheads in Peg.js)?


EDIT: more info over at Does the Peg.js engine backstep after a lookahead like regexs do?

So I've been learning about interpreters in general and specifically I've been working with peg.js recently to create a parser from a grammar.

Here's an example of an issue I'm having. Where, the following contains three "terms" ('abc def', 'ghi', and 'jkl') and two "delimiters" (' . '), how can I write a grammar:

abc def . ghi . jkl

It was no problem for me to do so with this:

abc . def . ghi

I used this:

expression
    = term ( _ delimiter _ term )*

term "term"
    = [a-z]+

delimiter "delimiter"
    = "."

_ "whitespace"
  = [ \t\n\r]+

However, it has been a big problem for me to do so with:

abc def . ghi . jkl

Once the terms themselves and the delimiters share a token - the whitespace - I'm unable to proceed. This for instance does not work:

term "term"
    = [a-z| ]+

The problem has been that anything I attempt seems to require that the lexer, or the pointer, I'm not sure the correct terminology, move to the period before finishing the term, so it fails, thinking that it has already passed the whitespace it was looking for for the delimiter.

I'm essentially unable to lookahead and say, ah this space is in fact the first value of the delimiter, and not the last of the expression.

The lookahead type operators like '&' only govern if a match is consumed or not, but still move the pointer into this position.

In fact, I would like to use both my delimiter characters in my terms like this:

term1.subterm1a subterm1b . term2 subterm2a.subterm2b
// two terms separated by ' . ' delimiter

How can I accomplish this?


Solution

  • I might be misunderstanding what you're trying to accomplish, but wouldn't something like this work?

    expression
        = terms ( _ delimiter _ terms )*
    
    terms "terms"
        = term ( _ term )*
    
    term "term"
        = [a-z]+
    
    delimiter "delimiter"
        = "."
    
    _ "whitespace"
      = [ \t\n\r]+