Search code examples
scalaparser-combinators

Ignoring prefixes in a JavaToken combinator parser


I'm trying to use a JavaToken combinator parser to pull out a particular match that's in the middle of larger string (ie ignore a random set of prefix chars). However I can't get it working and think I'm getting caught out by a greedy parser and/or CRs LFs. (the prefix chars can be basically anything). I have:

class RuleHandler extends JavaTokenParsers {

  def allowedPrefixChars = """[a-zA-Z0-9=*+-/<>!\_(){}~\\s]*""".r

  def findX: Parser[Double] = allowedPrefixChars ~ "(x=" ~> floatingPointNumber <~ ")" ^^ { case num => num.toDouble}

}

and then in my test case ..

    "when looking for the X value" in {
  "must find and correctly interpret X" in {
    val testString =
      """
        |Looking (only)
        |for (x=45) within
        |this string
      """.stripMargin
    val answer = ruleHandler.parse(ruleHandler.findX, testString)
    System.out.println(" X value is : " + answer.toString)
  }
}

I think it's similar to this SO question. Can anyone see whats wrong pls ? Tks.


Solution

  • First, you should not escape "\\s" twice inside """ """:

    def allowedPrefixChars = """[a-zA-Z0-9=*+-/<>!\_(){}~\s]*?""".r
    

    In your case it was interpreted separately "\" or "s" (s as symbol, not \s)

    Second, your allowedPrefixChars parser includes (, x, =, so it captures the whole string, including (x=, nothing is left to subsequent parsers.

    The solution is to be more concrete about prefix you want:

    object ruleHandler extends JavaTokenParsers {
    
      def allowedPrefixChar: Parser[String] = """[a-zA-Z0-9=*+-/<>!\_){}~\s]""".r //no "(" here
    
      def findX: Parser[Double] = rep(allowedPrefixChar | "\\((?!x=)".r ) ~ "(x=" ~> floatingPointNumber <~ ")" ^^ { case num => num.toDouble}
    }
    
    ruleHandler.parse(ruleHandler.findX, testString)
    res14: ruleHandler.ParseResult[Double] = [3.11] parsed: 45.0
    

    I've told the parser to ignore (, that has x= going after (it's just negative lookahead).

    Alternative:

    """\(x=(.*?)\)""".r.findAllMatchIn(testString).map(_.group(1).toDouble).toList
    res22: List[Double] = List(45.0)
    

    If you want to use parsers correctly, I would recommend you to describe the whole BNF grammar (with all possible (,) and = usages) - not just fragment. For example, include (only) into your parser if it's keyword, "(" ~> valueName <~ "=" ~ value to get value. Don't forget that scala-parser is intended to return you AST, not just some matched value. Pure regexps are better for regular matching from unstructured data.

    Example how it would like to use parsers in correct way (didn't try to compile):

    trait Command
    case class Rule(name: String, value: Double) extends Command
    case class Directive(name: String) extends Command
    
    class RuleHandler extends JavaTokenParsers { //why `JavaTokenParsers` (not `RegexParsers`) if you don't use tokens from Java Language Specification ?
    
      def string = """[a-zA-Z0-9*+-/<>!\_{}~\s]*""".r //it's still wrong you should use some predefined Java-like literals from **JavaToken**Parsers    
      def rule = "(" ~> string <~ "=" ~> string <~ ")" ^^ { case name ~ num => Rule(name, num.toDouble} }   
      def directive = "(" ~> string <~ ")" ^^ { case name => Directive(name) }   
      def commands: Parser[Command] =  repsep(rule | directive, string)
    
    }
    

    If you need to process natural language (Chomsky type-0), scalanlp or something similar fits better.