I am trying to write a PEGjs rule to convert
Return _a_b_c_.
to
Return <>a_b_c</>.
My grammar is
root = atoms:atom+
{ return atoms.join(''); }
atom = variable
/ normalText
variable = "_" first:variableSegment rest:$("_" variableSegment)* "_"
{ return '<>' + first + rest + '</>'; }
variableSegment = $[^\n_ ]+
normalText = $[^\n]
This works for
Return _a_b_c_ .
and
Return _a_b_c_
but something is going wrong with the
Return _a_b_c_.
example.
I can't quite understand why this is breaking, and would love an explanation of why it's behaving as it does. (I don't even need a solution to the problem, necessarily; the biggest issue is that my mental model of PEGjs grammars is deficient.)
Rearranging the grammar slightly makes it work:
root = atoms:atom+
{ return atoms.join(''); }
atom = variable
/ normalText
variable = "_" first:$(variableSegment "_") rest:$(variableSegment "_")*
{ return '<>' + first + rest + '</>'; }
variableSegment = seg:$[^\n_ ]+
normalText = normal:$[^\n]
I'm not sure I understand why, exactly. In this one, the parser gets to the "." and matches it as a "variableSegment", but then backtracks just one step in the greedy "*" lookahead, decides it's got a "variable", and then re-parses the "." as "normal". (Note that this picks up the trailing _
, which if not desired can be snipped off by a hack in action, or something like that; see below.)
In the original version, after failing because of the missing trailing underscore, the very next step the parser takes is back to the leading underscore, opting for the "normal" interpretation.
I added some action code with console.log()
calls to trace the parser behavior.
edit — I think the deal is this. In your original version, the parse is failing on a rule that's of the form
expr1 expr2 expr3 ... exprN
The first sub-expression is the literal _
. The next is for the first variable segment. The third is for the sequence of variable expressions preceded by _
, and the last is the trailing _
. While working through that rule on the problematic input, the last expression fails. The others have all succeeded, however, so the only place to start over is at the alternative point in the "atom" rule.
In the revised version, the parser can unwind the operation of the greedy *
by one step. It then has a successful match of the third expression, so the rule succeeds.
Thus another revision, closer to the original, will also work:
root = atoms:atom+
{ return atoms.join(''); }
atom = variable
/ normalText
variable = "_" first:variableSegment rest:$("_" variableSegment & "_")* "_"
{ return '<>' + first + rest + '</>'; }
variableSegment = $[^\n_ ]+
normalText = $[^\n]
Now that greedy *
group will backtrack when it fails in peeking forward at an _
.