java regex scala pattern-matching data-extraction

Why scala.util.matching.Regex 'apparently' fails in Scala extractors?

I'm using Scala extractors (i.e.: Regex inside in a pattern mathing) in order to identify doubles and longs, like shown below.

My question is: why Regex is apparently failing when employed in a pattern matching whilst it clearly delivers the expected results when employed in a chain of if/then/else expressions?

val LONG   = """^(0|-?[1-9][0-9]*)$"""
val DOUBLE = """NaN|^-?(0(\.[0-9]*)?|([1-9][0-9]*\.[0-9]*)|(\.[0-9]+))([Ee][+-]?[0-9]+)?$"""

val scalaLONG   : scala.util.matching.Regex = LONG.r
val scalaDOUBLE : scala.util.matching.Regex = DOUBLE.r

val types1 = Seq("abc", "3", "3.0", "-3.0E-05", "NaN").map(text =>
    text match {
      case scalaLONG(long)     => s"Long"
      case scalaDOUBLE(double) => s"Double"
      case _                   => s"String"
    })
// Results types1: Seq[String] = List("String", "Long", "String", "String", "String")

val types2 = Seq("abc", "3", "3.0", "-3.0E-05", "NaN").map(text =>
    if(scalaDOUBLE.findFirstIn(text).isDefined) "Double" else
    if(scalaLONG  .findFirstIn(text).isDefined) "Long"   else    
    "String")
// Results types2: Seq[String] = List("String", "Long", "Double", "Double", "Double")

As you can see from above, types2 delivers the expected results whilst types1 tells "String" when "Double" is expected, apparently pointing out to a failure in the Regex processing.

EDIT: With help from @alex-savitsky and @leo-c, I've arrived to the following shown below, which works as expected. However, I have to remember to provide an empty argument list in the pattern matching, otherwise it gives wrong results. This looks error prone to me.

val LONG   = """^(?:0|-?[1-9][0-9]*)$"""
val DOUBLE = """^NaN|-?(?:0(?:\.[0-9]*)?|(?:[1-9][0-9]*\.[0-9]*)|(?:\.[0-9]+))(?:[Ee][+-]?[0-9]+)?$"""

val scalaLONG   : scala.util.matching.Regex = LONG.r
val scalaDOUBLE : scala.util.matching.Regex = DOUBLE.r

val types1 = Seq("abc", "3", "3.0", "-3.0E-05", "NaN").map(text =>
    text match {
      case scalaLONG()     => s"Long"
      case scalaDOUBLE()   => s"Double"
      case _               => s"String"
    })
// Results types1: Seq[String] = List("String", "Long", "Double", "Double", "Double")

val types2 = Seq("abc", "3", "3.0", "-3.0E-05", "NaN").map(text =>
    if(scalaDOUBLE.findFirstIn(text).isDefined) "Double" else
    if(scalaLONG  .findFirstIn(text).isDefined) "Long"   else    
    "String")
// Results types2: Seq[String] = List("String", "Long", "Double", "Double", "Double")

EDIT: OK... despite error prone... it is an extractor pattern, which employs unapply behind the scenes and, in this case, we have to pass arguments to unnapply. @alex-savitsky is using _* in his edit, which explicitly enforces intention of dropping all capture groups. Looks good to me.

Solution

match matches the whole input, while findFirstIn can match partial input contents, sometimes resulting in more matches. In fact, findFirstIn will ignore your boundary markings ^$ outright.

If your intention was to match the whole input, put your ^ at the beginning of the regex, as in val DOUBLE = """^NaN|-?(0(\.[0-9]*)?|([1-9][0-9]*\.[0-9]*)|(\.[0-9]+))([Ee][+-]?[0-9]+)?$""", then the types1 would match the types correctly.

EDIT: Here's my test case for your question

object Test extends App {
    val regex = """^NaN|-?(?:0(?:\.[0-9]*)?|(?:[1-9][0-9]*\.[0-9]*)|(?:\.[0-9]+))(?:[Ee][+-]?[0-9]+)?$""".r
    println(Seq("abc", "3", "3.0", "-3.0E-05", "NaN").map {
        case regex() => "Double"
        case _ => "String"
    })
}

results in List(String, String, Double, Double, Double)

As you see, the non-capturing groups make all the difference.

If you still want to use capturing groups, you can use _* to ignore the capture result:

object Test extends App {
    val regex = """^NaN|-?(0(\.[0-9]*)?|([1-9][0-9]*\.[0-9]*)|(\.[0-9]+))([Ee][+-]?[0-9]+)?$""".r
    println(Seq("abc", "3", "3.0", "-3.0E-05", "NaN").map {
        case regex(_*) => "Double"
        case _ => "String"
    })
}