Search code examples
scalaparsingcsvpegparboiled2

Parboiled2 grammar for parsing escaped CSV line


I am trying to parse a single line which contains strings separated by delimiters into a sequence of these strings. It should be able to have any character in the strings, if a field contains a delimiter it needs double quotes around it. In order to have double quotes in such a field, the double quotes are escaped by .

I used this as a starting point: https://github.com/sirthias/parboiled2/blob/695ee6603359cfcb97734edf6dd1d27383c48727/examples/src/main/scala/org/parboiled2/examples/CsvParser.scala

My grammar looks like this:

class CsvParser(val input: ParserInput, val delimiter: String = ",") extends Parser {
  def line: Rule1[Seq[String]] = rule {record ~ EOI}
  def record = rule(oneOrMore(field).separatedBy(delimiter))

  def QUOTE = "\""
  def ESCAPED_QUOTE = "\\\""
  def DELIMITER_QUOTE = delimiter+"\""
  def WS = " \t".replace(delimiter, "")

  def field = rule{whiteSpace ~ ((QUOTE ~ escapedField ~ QUOTE) | unquotedField) ~ whiteSpace}
  def escapedField = rule { capture(zeroOrMore(noneOf(QUOTE) | ESCAPED_QUOTE)) ~> (_.replace(ESCAPED_QUOTE, QUOTE))  } 
  def unquotedField = rule { capture(zeroOrMore(noneOf(DELIMITER_QUOTE))) }
  def whiteSpace = rule(zeroOrMore(anyOf(WS)))
}

When I call it with "quote\"key",1,2 I get Invalid input 'k', expected whiteSpace, ',' or 'EOI' (line 1, column 9)

What am I doing wrong? How would I debug this? (And as a bonus question: How would I extend the grammar to allow the delimiter to be multiple chars like ##?)

Thank you!


Solution

  • Parboiled2 seems to execute rules without backtracking.

    In this particular case

    def escapedField = rule { capture(zeroOrMore(noneOf(QUOTE) | ESCAPED_QUOTE)) ~> (_.replace(ESCAPED_QUOTE, QUOTE))  } 
    

    noneOf(QUOTE) captures the \ from \" and then returns, instead of backtracking and trying to capture the full \".

    The error is solved by using

    def escapedField = rule { capture(ESCAPED_QUOTE | zeroOrMore(noneOf(QUOTE))) ~> (_.replace(ESCAPED_QUOTE, QUOTE))  }