Search code examples
scalaparser-combinators

Use Scala parser combinators to read multivalue field into list of alternate values


Problem

I want to parse line like this:

fieldName: value1|value2 anotherFieldName: value3 valueWithoutFieldName

into

List(Some("fieldName") ~ List("value1", "value2"), Some("anotherFieldName") ~ List("value3"), None~List("valueWithoutFieldName"))

(Alternative field values are separated by pipe (|). Field name is optional. If field has no name, it should be parsed as None (see valueWithoutFieldName)

My current (not working) solution

This is what I have so far:

val parser: Parser[ParsedQuery] = {
  phrase(rep(opt(fieldNameTerm) ~ (multipleValueTerm | singleValueTerm))) ^^ {
    case termSequence =>
      // operate on List[Option[String] ~ List[String]]
  }
}

val fieldNameTerm: Parser[String] = {
  ("\\w+".r <~ ":(?=\\S)".r) ^^ {
    case fieldName => fieldName
  }
}

val multipleValueTerm = rep1((singleValueTerm <~ alternativeValueTerm) | (alternativeValueTerm ~> singleValueTerm))

val alternativeValueTerm: Parser[String] = {
  // matches '|'
  ("\\|".r) ^^ {
    case token => token
  }
}

val singleValueTerm: Parser[String] = {
  // all non-whitespace characters except '|'
  ("[\\S&&[^\\|]]+".r) ^^ {
    case token => token
  }
}

Unfortunately, my code does not parse last possible field value (the last value after pipe) correctly and treats it as value of a new nameless field. For instance:

The following string:

"abc:111|222|333|444 cde:555"

is parsed into:

List((Some(abc)~List(111, 222, 333)), (None~444), (Some(cde)~555))

while I'd like it to be:

List((Some(abc)~List(111, 222, 333, 444)), (Some(cde)~555))

My suspicions

I think that the problem lies in definition of multipleValueTerm:

rep1((singleValueTerm <~ alternativeValueTerm) | (alternativeValueTerm ~> singleValueTerm))

It's second part is probably not interpreted correctly, but I have no idea why.

Shouldn't <~ from the first part of multipleValueTerm left pipe representing value separator, so that second part of this expression (alternativeValueTerm ~> singleValueTerm) is able to parse it successfully?


Solution

  • Let's look at what's happening. We want to parse: 111|222|333|444 with multiValueTerm.

    111| fits (singleValueTerm <~ alternativeValueTerm). <~ throws away the | and we take the 111.

    So we have 222|333|444 left.

    Analog to the previous: 222| and 333| are taken. So we are left with 444. But 444 does not fit either (singleValueTerm <~ alternativeValueTerm) or (alternativeValueTerm ~> singleValueTerm). So it is not taken. That is why it will be treated as a new value without variable.

    I would improve your parser this way:

    val seperator = "|"
    
    lazy val parser: Parser[List[(Option[String] ~ List[String])]] = rep(termParser)
    
    lazy val termParser: Parser[(Option[String] ~ List[String])] = opt(fieldNameTerm) ~ valueParser
    
    lazy val fieldNameTerm: Parser[String] = ("\\w+".r <~ ":(?=\\S)".r)
    
    lazy val valueParser: Parser[List[String]] = rep1sep(singleValueTerm, seperator)
    
    lazy val singleValueTerm: Parser[String] = ("[\\S&&[^\\|]]+".r)
    

    There is no need for all this identity stuff ^^ {case x => x}. I removed that. Then I treat single- and multi-values the same way. It is either a List with exactly one or more elements. repsep is nice for dealing with seperators.

    rep1sep(singleValueTerm, seperator) could be equivalently expressed with singlevalueTerm ~ rep(seperator ~> singlevalueTerm)