Search code examples
javaregexscalapattern-matchingdata-extraction

Why scala.util.matching.Regex 'apparently' fails in Scala extractors?


I'm using Scala extractors (i.e.: Regex inside in a pattern mathing) in order to identify doubles and longs, like shown below.

My question is: why Regex is apparently failing when employed in a pattern matching whilst it clearly delivers the expected results when employed in a chain of if/then/else expressions?

val LONG   = """^(0|-?[1-9][0-9]*)$"""
val DOUBLE = """NaN|^-?(0(\.[0-9]*)?|([1-9][0-9]*\.[0-9]*)|(\.[0-9]+))([Ee][+-]?[0-9]+)?$"""

val scalaLONG   : scala.util.matching.Regex = LONG.r
val scalaDOUBLE : scala.util.matching.Regex = DOUBLE.r

val types1 = Seq("abc", "3", "3.0", "-3.0E-05", "NaN").map(text =>
    text match {
      case scalaLONG(long)     => s"Long"
      case scalaDOUBLE(double) => s"Double"
      case _                   => s"String"
    })
// Results types1: Seq[String] = List("String", "Long", "String", "String", "String")

val types2 = Seq("abc", "3", "3.0", "-3.0E-05", "NaN").map(text =>
    if(scalaDOUBLE.findFirstIn(text).isDefined) "Double" else
    if(scalaLONG  .findFirstIn(text).isDefined) "Long"   else    
    "String")
// Results types2: Seq[String] = List("String", "Long", "Double", "Double", "Double")

As you can see from above, types2 delivers the expected results whilst types1 tells "String" when "Double" is expected, apparently pointing out to a failure in the Regex processing.

EDIT: With help from @alex-savitsky and @leo-c, I've arrived to the following shown below, which works as expected. However, I have to remember to provide an empty argument list in the pattern matching, otherwise it gives wrong results. This looks error prone to me.

val LONG   = """^(?:0|-?[1-9][0-9]*)$"""
val DOUBLE = """^NaN|-?(?:0(?:\.[0-9]*)?|(?:[1-9][0-9]*\.[0-9]*)|(?:\.[0-9]+))(?:[Ee][+-]?[0-9]+)?$"""

val scalaLONG   : scala.util.matching.Regex = LONG.r
val scalaDOUBLE : scala.util.matching.Regex = DOUBLE.r

val types1 = Seq("abc", "3", "3.0", "-3.0E-05", "NaN").map(text =>
    text match {
      case scalaLONG()     => s"Long"
      case scalaDOUBLE()   => s"Double"
      case _               => s"String"
    })
// Results types1: Seq[String] = List("String", "Long", "Double", "Double", "Double")

val types2 = Seq("abc", "3", "3.0", "-3.0E-05", "NaN").map(text =>
    if(scalaDOUBLE.findFirstIn(text).isDefined) "Double" else
    if(scalaLONG  .findFirstIn(text).isDefined) "Long"   else    
    "String")
// Results types2: Seq[String] = List("String", "Long", "Double", "Double", "Double")

EDIT: OK... despite error prone... it is an extractor pattern, which employs unapply behind the scenes and, in this case, we have to pass arguments to unnapply. @alex-savitsky is using _* in his edit, which explicitly enforces intention of dropping all capture groups. Looks good to me.


Solution

  • match matches the whole input, while findFirstIn can match partial input contents, sometimes resulting in more matches. In fact, findFirstIn will ignore your boundary markings ^$ outright.

    If your intention was to match the whole input, put your ^ at the beginning of the regex, as in val DOUBLE = """^NaN|-?(0(\.[0-9]*)?|([1-9][0-9]*\.[0-9]*)|(\.[0-9]+))([Ee][+-]?[0-9]+)?$""", then the types1 would match the types correctly.

    EDIT: Here's my test case for your question

    object Test extends App {
        val regex = """^NaN|-?(?:0(?:\.[0-9]*)?|(?:[1-9][0-9]*\.[0-9]*)|(?:\.[0-9]+))(?:[Ee][+-]?[0-9]+)?$""".r
        println(Seq("abc", "3", "3.0", "-3.0E-05", "NaN").map {
            case regex() => "Double"
            case _ => "String"
        })
    }
    

    results in List(String, String, Double, Double, Double)

    As you see, the non-capturing groups make all the difference.

    If you still want to use capturing groups, you can use _* to ignore the capture result:

    object Test extends App {
        val regex = """^NaN|-?(0(\.[0-9]*)?|([1-9][0-9]*\.[0-9]*)|(\.[0-9]+))([Ee][+-]?[0-9]+)?$""".r
        println(Seq("abc", "3", "3.0", "-3.0E-05", "NaN").map {
            case regex(_*) => "Double"
            case _ => "String"
        })
    }