Search code examples
scalaparsingparser-combinators

How to make my parser support logic operations and word case-insensitive?


Recently, I am learning the Scala parser combinator. I would like to parse the key in a given string. For instance,

val expr1 = "local_province != $province_name$ or city=$city_name$ or people_number<>$some_digit$"

// ==> List("local_province", "city", "people_number")

val expr2 = "(local_province=$province_name$)"

// ==> List("local_province")

val expr3 = "(local_province=$province_name$ or city=$city_name$) and (lib_name=$akka$ or lib_author=$martin$)"

// ==> List("local_province", "city", "lib_name", "lib_author")

Trial

import scala.util.parsing.combinator.JavaTokenParsers

class KeyParser extends JavaTokenParsers {
  lazy val key = """[a-zA-Z_]+""".r
  lazy val value = "$" ~ key ~ "$"
  lazy val logicOps = ">" | "<" | "=" | ">=" | "<=" | "!=" | "<>"
  lazy val elem: Parser[String] = key <~ (logicOps  ~ value)
  lazy val expr: Parser[List[String]] =
    "(" ~> repsep(elem, "and" | "or") <~ ")" | repsep(elem, "and" | "or")

  lazy val multiExpr: Parser[List[String]] =
    repsep(expr, "and" | "or") ^^ { _.foldLeft(List.empty[String])(_ ++ _) }
}

object KeyParser extends KeyParser {
  def parse(input: String) = parseAll(multiExpr, input)
}

Here is my test in Scala REPL

KeyParser.parse(expr1)

[1.72] failure: $' expected but >' found

KeyParser.parse(expr2)

[1.33] parsed: List(local_province)

KeyParser.parse(expr3)

[1.98] parsed: List(local_province, city, lib_name, lib_author)

I notice that the KeyParser only works for "=" and it doesn't support the case like "(local_province<>$province_name$ AND city!=$city_name$)" which contains "<> | !=" and "AND".

So I would like to know how to revise it.


Solution

  • I notice that the KeyParser only works for "="

    This isn't quite true. It also works for !=, < and >. The ones it doesn't work for are >=, <= and <>.

    More generally it does not work for those operators which have a prefix of them appear in the list of alternatives before them. That is >= is not matched because > appears before it and is a prefix of it.

    So why does this happen? The | operator creates a parser that produces the result of the left parser if it succeeds or of the right parser otherwise. So if you have a chain of |s, you'll get the result of the first parser in that chain which can match the current input. So if the current input is <>$some_digit$, the parser logicOps will match < and leave you with >$some_digit$ as the remaining input. So now it tries to match value against that input and fails.

    Why doesn't backtracking help here? Because the logicOps parser already succeeded, so there's nowhere to backtrack to. If the parser were structured like this:

    lazy val logicOpAndValue = ">" ~ value | "<" ~ value | "=" ~ value |
                               ">=" ~ value | "<=" ~ value | "!=" ~ value |
                               "<>" ~ value
    lazy val elem: Parser[String] = key <~ logicOpAndValue
    

    Then the following would happen (with the current input being <>$some_digit$):

    • ">" does not match the current input, so go to next alternative
    • "<" does match the current input, so try the right operand of the ~ (i.e. value) with the current input >$some_digit$. This fails, so continue with the next alternative.
    • ... bunch of alternatives that don't match ...
    • "<>" does match the current input, so try the right operand of the ~. This matches as well. Success!

    However in your code the ~ value is outside of the list of alternatives, not inside each alternative. So when the parser fails, we're no longer inside any alternative, so there's no next alternative to try and it just fails.

    Of course moving the ~ value inside the alternatives isn't really a satisfying solution as it's ugly as hell and not very maintainable in the general case.

    One solution is simply to move the longer operators at the beginning of the alternatives (i.e. ">=" | "<=" | "<>" | ">" | "<" | ...). This way ">" and "<" will only be tried if ">=", "<=" and "<>" have already failed.

    A still better solution, which does not rely on the order of alternatives and is thus less error-prone, is to use ||| instead of. ||| works like | except that it tries all of the alternatives and then returns the longest successful result - not the first.


    PS: This isn't related to your problem but you're currently limiting the nesting depth of parentheses because your grammar is not recursive. To allow unlimited nesting, you'll want your expr and multiExpr rules to look like this:

    lazy val expr: Parser[List[String]] =
      "(" ~> multiExpr <~ ")" | elem
    lazy val multiExpr: Parser[List[String]] =
      repsep(expr, "and" | "or") ^^ { _.foldLeft(List.empty[String])(_ ++ _) }
    

    However I recommend renaming expr to something like primaryExpr and multiExpr to expr.

    _.foldLeft(List.empty[String])(_ ++ _) can also be more succinctly expressed as _.flatten.