Search code examples
javascriptparsinggrammarpegjs

Detecting _vars_with_underscores_; why does this not work?


I am trying to write a PEGjs rule to convert

Return _a_b_c_.

to

Return <>a_b_c</>.

My grammar is

root = atoms:atom+
{ return atoms.join(''); }

atom = variable
     / normalText

variable = "_" first:variableSegment rest:$("_" variableSegment)* "_"
{ return '<>' + first + rest + '</>'; }

variableSegment = $[^\n_ ]+

normalText = $[^\n]

This works for

Return _a_b_c_ .

and

Return _a_b_c_

but something is going wrong with the

Return _a_b_c_.

example.

I can't quite understand why this is breaking, and would love an explanation of why it's behaving as it does. (I don't even need a solution to the problem, necessarily; the biggest issue is that my mental model of PEGjs grammars is deficient.)


Solution

  • Rearranging the grammar slightly makes it work:

    root = atoms:atom+
    { return atoms.join(''); }
    
    atom = variable
         / normalText
    
    variable = "_" first:$(variableSegment "_") rest:$(variableSegment "_")*
    { return '<>' + first + rest + '</>'; }
    
    variableSegment = seg:$[^\n_ ]+
    
    normalText = normal:$[^\n]
    

    I'm not sure I understand why, exactly. In this one, the parser gets to the "." and matches it as a "variableSegment", but then backtracks just one step in the greedy "*" lookahead, decides it's got a "variable", and then re-parses the "." as "normal". (Note that this picks up the trailing _, which if not desired can be snipped off by a hack in action, or something like that; see below.)

    In the original version, after failing because of the missing trailing underscore, the very next step the parser takes is back to the leading underscore, opting for the "normal" interpretation.

    I added some action code with console.log() calls to trace the parser behavior.

    edit — I think the deal is this. In your original version, the parse is failing on a rule that's of the form

    expr1 expr2 expr3 ... exprN

    The first sub-expression is the literal _. The next is for the first variable segment. The third is for the sequence of variable expressions preceded by _, and the last is the trailing _. While working through that rule on the problematic input, the last expression fails. The others have all succeeded, however, so the only place to start over is at the alternative point in the "atom" rule.

    In the revised version, the parser can unwind the operation of the greedy * by one step. It then has a successful match of the third expression, so the rule succeeds.

    Thus another revision, closer to the original, will also work:

    root = atoms:atom+
    { return atoms.join(''); }
    
    atom = variable
         / normalText
    
    variable = "_" first:variableSegment rest:$("_" variableSegment & "_")* "_"
    { return '<>' + first + rest + '</>'; }
    
    variableSegment = $[^\n_ ]+
    
    normalText = $[^\n]
    

    Now that greedy * group will backtrack when it fails in peeking forward at an _.