Recently, I am learning the Scala parser combinator. I would like to parse the key
in a given string. For instance,
val expr1 = "local_province != $province_name$ or city=$city_name$ or people_number<>$some_digit$"
// ==>
List("local_province", "city", "people_number")
val expr2 = "(local_province=$province_name$)"
// ==>
List("local_province")
val expr3 = "(local_province=$province_name$ or city=$city_name$) and (lib_name=$akka$ or lib_author=$martin$)"
// ==>
List("local_province", "city", "lib_name", "lib_author")
import scala.util.parsing.combinator.JavaTokenParsers
class KeyParser extends JavaTokenParsers {
lazy val key = """[a-zA-Z_]+""".r
lazy val value = "$" ~ key ~ "$"
lazy val logicOps = ">" | "<" | "=" | ">=" | "<=" | "!=" | "<>"
lazy val elem: Parser[String] = key <~ (logicOps ~ value)
lazy val expr: Parser[List[String]] =
"(" ~> repsep(elem, "and" | "or") <~ ")" | repsep(elem, "and" | "or")
lazy val multiExpr: Parser[List[String]] =
repsep(expr, "and" | "or") ^^ { _.foldLeft(List.empty[String])(_ ++ _) }
}
object KeyParser extends KeyParser {
def parse(input: String) = parseAll(multiExpr, input)
}
Here is my test in Scala REPL
KeyParser.parse(expr1)
[1.72] failure:
$' expected but
>' foundKeyParser.parse(expr2)
[1.33] parsed: List(local_province)
KeyParser.parse(expr3)
[1.98] parsed: List(local_province, city, lib_name, lib_author)
I notice that the KeyParser
only works for "="
and it doesn't support the case like "(local_province<>$province_name$ AND city!=$city_name$)"
which contains "<> | !="
and "AND".
So I would like to know how to revise it.
I notice that the
KeyParser
only works for"="
This isn't quite true. It also works for !=
, <
and >
. The ones it doesn't work for are >=
, <=
and <>
.
More generally it does not work for those operators which have a prefix of them appear in the list of alternatives before them. That is >=
is not matched because >
appears before it and is a prefix of it.
So why does this happen? The |
operator creates a parser that produces the result of the left parser if it succeeds or of the right parser otherwise. So if you have a chain of |
s, you'll get the result of the first parser in that chain which can match the current input. So if the current input is <>$some_digit$
, the parser logicOps
will match <
and leave you with >$some_digit$
as the remaining input. So now it tries to match value
against that input and fails.
Why doesn't backtracking help here? Because the logicOps
parser already succeeded, so there's nowhere to backtrack to. If the parser were structured like this:
lazy val logicOpAndValue = ">" ~ value | "<" ~ value | "=" ~ value |
">=" ~ value | "<=" ~ value | "!=" ~ value |
"<>" ~ value
lazy val elem: Parser[String] = key <~ logicOpAndValue
Then the following would happen (with the current input being <>$some_digit$
):
">"
does not match the current input, so go to next alternative"<"
does match the current input, so try the right operand of the ~
(i.e. value
) with the current input >$some_digit$
. This fails, so continue with the next alternative."<>"
does match the current input, so try the right operand of the ~
. This matches as well. Success!However in your code the ~ value
is outside of the list of alternatives, not inside each alternative. So when the parser fails, we're no longer inside any alternative, so there's no next alternative to try and it just fails.
Of course moving the ~ value
inside the alternatives isn't really a satisfying solution as it's ugly as hell and not very maintainable in the general case.
One solution is simply to move the longer operators at the beginning of the alternatives (i.e. ">=" | "<=" | "<>" | ">" | "<" | ...
). This way ">"
and "<"
will only be tried if ">="
, "<="
and "<>"
have already failed.
A still better solution, which does not rely on the order of alternatives and is thus less error-prone, is to use |||
instead of. |||
works like |
except that it tries all of the alternatives and then returns the longest successful result - not the first.
PS: This isn't related to your problem but you're currently limiting the nesting depth of parentheses because your grammar is not recursive. To allow unlimited nesting, you'll want your expr
and multiExpr
rules to look like this:
lazy val expr: Parser[List[String]] =
"(" ~> multiExpr <~ ")" | elem
lazy val multiExpr: Parser[List[String]] =
repsep(expr, "and" | "or") ^^ { _.foldLeft(List.empty[String])(_ ++ _) }
However I recommend renaming expr
to something like primaryExpr
and multiExpr
to expr
.
_.foldLeft(List.empty[String])(_ ++ _)
can also be more succinctly expressed as _.flatten
.