Search code examples
parsingescapingpegpegjs

Peg parser - support for escape characters


I'm working on a Peg parser. Among other structures, it needs to parse a tag directive. A tag can contain any character. If you want the tag to include a curly brace } you can escape it with a backslash. If you need a literal backslash, that should also be escaped. I tried to implement this inspired by the Peg grammer for JSON: https://github.com/pegjs/pegjs/blob/master/examples/json.pegjs

There are two problems:

  • an escaped backslash results in two backslash characters instead of one. Example input:
{ some characters but escape with a \\ }
  • the parser breaks on an escaped curly \}. Example input:
{ some characters but escape \} with a \\ }

The relevant grammer is:

Tag
  = "{" _ tagContent:$(TagChar+) _ "}" {
  return { type: "tag", content: tagContent }
}

TagChar
  = [^\}\r\n]
  / Escape
    sequence:(
        "\\" { return {type: "char", char: "\\"}; }
      / "}" { return {type: "char", char: "\x7d"}; }
    )
    { return sequence; }
    
_ "whitespace"
  = [ \t\n\r]*
  
Escape
  = "\\"

You can easily test grammar and test input with the online PegJS sandbox: https://pegjs.org/online

I hope somebody has an idea to resolve this.


Solution

  • These errors are both basically typos.

    The first problem is the character class in your regular expression for tag characters. In a character class, \ continues to be an escape character, so [^\}\r\n] matches any character other than } (written with an unnecessary backslash escape), carriage return or newline. \ is such a character, so it's matched by the character class, and Escape is never tried.

    Since your pattern for tag characters doesn't succeed in recognising \ as an Escape, the tag { \\ } is parsed as four characters (space, backslash, backslash, space) and the tag { \} } is parsed as terminating on the first }, creating a syntax error.

    So you should fix the character class to [^}\\\r\n] (I put the closing brace first in order to make it easier to read the falling timber. The order is irrelevant.)

    Once you do that, you'll find that the parser still returns the string with the backslashes intact. That's because of the $ in your Tag pattern: "{" _ tagContent:$(TagChar+) _ "}". According to the documentation, the meaning of the $ operator is: (emphasis added)

    $ expression

    Try to match the expression. If the match succeeds, return the matched text instead of the match result.