I want to parse line like this:
fieldName: value1|value2 anotherFieldName: value3 valueWithoutFieldName
into
List(Some("fieldName") ~ List("value1", "value2"), Some("anotherFieldName") ~ List("value3"), None~List("valueWithoutFieldName"))
(Alternative field values are separated by pipe (|
). Field name is optional. If field has no name, it should be parsed as None
(see valueWithoutFieldName
)
This is what I have so far:
val parser: Parser[ParsedQuery] = {
phrase(rep(opt(fieldNameTerm) ~ (multipleValueTerm | singleValueTerm))) ^^ {
case termSequence =>
// operate on List[Option[String] ~ List[String]]
}
}
val fieldNameTerm: Parser[String] = {
("\\w+".r <~ ":(?=\\S)".r) ^^ {
case fieldName => fieldName
}
}
val multipleValueTerm = rep1((singleValueTerm <~ alternativeValueTerm) | (alternativeValueTerm ~> singleValueTerm))
val alternativeValueTerm: Parser[String] = {
// matches '|'
("\\|".r) ^^ {
case token => token
}
}
val singleValueTerm: Parser[String] = {
// all non-whitespace characters except '|'
("[\\S&&[^\\|]]+".r) ^^ {
case token => token
}
}
Unfortunately, my code does not parse last possible field value (the last value after pipe) correctly and treats it as value of a new nameless field. For instance:
The following string:
"abc:111|222|333|444 cde:555"
is parsed into:
List((Some(abc)~List(111, 222, 333)), (None~444), (Some(cde)~555))
while I'd like it to be:
List((Some(abc)~List(111, 222, 333, 444)), (Some(cde)~555))
I think that the problem lies in definition of multipleValueTerm
:
rep1((singleValueTerm <~ alternativeValueTerm) | (alternativeValueTerm ~> singleValueTerm))
It's second part is probably not interpreted correctly, but I have no idea why.
Shouldn't <~
from the first part of multipleValueTerm
left pipe representing value separator, so that second part of this expression (alternativeValueTerm ~> singleValueTerm
) is able to parse it successfully?
Let's look at what's happening. We want to parse: 111|222|333|444
with multiValueTerm
.
111|
fits (singleValueTerm <~ alternativeValueTerm)
. <~
throws away the |
and we take the 111
.
So we have 222|333|444
left.
Analog to the previous: 222|
and 333|
are taken. So we are left with 444
. But 444
does not fit either (singleValueTerm <~ alternativeValueTerm)
or (alternativeValueTerm ~> singleValueTerm)
. So it is not taken. That is why it will be treated as a new value without variable.
I would improve your parser this way:
val seperator = "|"
lazy val parser: Parser[List[(Option[String] ~ List[String])]] = rep(termParser)
lazy val termParser: Parser[(Option[String] ~ List[String])] = opt(fieldNameTerm) ~ valueParser
lazy val fieldNameTerm: Parser[String] = ("\\w+".r <~ ":(?=\\S)".r)
lazy val valueParser: Parser[List[String]] = rep1sep(singleValueTerm, seperator)
lazy val singleValueTerm: Parser[String] = ("[\\S&&[^\\|]]+".r)
There is no need for all this identity stuff ^^ {case x => x}
. I removed that. Then I treat single- and multi-values the same way. It is either a List
with exactly one or more elements. repsep
is nice for dealing with seperators.
rep1sep(singleValueTerm, seperator)
could be equivalently expressed with
singlevalueTerm ~ rep(seperator ~> singlevalueTerm)