Search code examples
javascriptpegpegjs

Why does an expression like `(!"foo" .*)` generate arrays of `[undefined, char]`-values in PEG.js


I'm still pretty new to PEG.js, and I'm guessing this is just a beginner misunderstanding.

In trying to parse something like this:

definitions
    some text

if
    some additonal text
    to parse here    

then
    still more text will
    go here

I can get a grammar to properly read the three section (to be further parsed later, of course.) But it generates that text in an odd format. For instance, in the above, "some text" turns into

[
  [undefined, "s"], [undefined, "o"], [undefined, "m"], [undefined, "e"], [undefined, " "], 
  [undefined, "t"], [undefined, "e"], [undefined, "x"], [undefined, "t"]
]

I can easily enough convert this to a plain string, but I'm wondering what I'm doing to give it that awful format. This is my grammar so far:

{
  const combine = (xs) => xs .map (x => x[1]) .join('')
}

MainObject
  = _ defs:DefSection _ condition:CondSection _ consequent: ConsequentSection
    {return {defs, condition, consequent}}

DefSection = _ "definitions"i _ defs:(!"\nif" .)+
  {return defs}

CondSection = _ "if"i _ cond:(!"\nthen" .)+
  {return combine (cond)}

ConsequentSection = _ "then"i _ cons:.*
  {return cons .join ('')} 

_ "whitespace"
  = [ \t\n\r]*

I can fix it by replacing {return defs} with {return combine(defs)} as in the other sections.

My main question is simply why does it generate that output? And is there a simpler way to fix it?


Overall, as I'm still pretty new to PEG.js, and I would love to know if there is a better way to write this grammar. Expressions like (!"\nif" .*) seem fairly sketchy.


Solution

    1. Negative look ahead e.g. !Rule, will always return undefined, will fail if the Rule match.
    2. The dot . will always match a single character.
    3. A sequence Rule1 Rule2 ... will create a list with the results of each rule
    4. A repetition Rule+ or Rule* will match Rule as many times as possible and create a list. (+ fails if the first attempt to match rule fails)

    Your results are

    [ // Start (!"\nif" .)
      [undefined // First "\nif", 
    "s" // First .
    ] // first ("\nif" .)
    , 
    [undefined, "o"] // Second (!"\nif" .)
    , [undefined, "m"], [undefined, "e"], [undefined, " "], 
      [undefined, "t"], [undefined, "e"], [undefined, "x"], [undefined, "t"]
    ] // This list is (!"\nif" .)*, all the matches of ("\nif" .)
    

    What you seem to want is to read the text instead, and you can use the operator $Rule for this, it will return the input instead of the produced output.

    MainObject
      = _ defs:DefSection _ condition:CondSection _ consequent: ConsequentSection
        {return {defs, condition, consequent}}
    
    DefSection = _ "definitions"i _ defs:$(!"\nif" .)+
      {return defs.trim()}
    
    CondSection = _ "if"i _ cond:$(!"\nthen" .)+
      {return cond.trim()}
    
    ConsequentSection = _ "then"i _ cons:$(.*)
      {return cons.trim()} 
    
    _ "whitespace"
      = [ \t\n\r]*
    

    Will produce

    {
       "defs": "some text",
       "condition": "some additonal text
        to parse here",
       "consequent": "still more text will
        go here"
    }