Search code examples
scalaapache-sparkstructpattern-matchingcase-class

Scala - Spark sql row pattern matching on struct


I'm trying to do pattern matching inside a Dataframe map function - matching a Row with a Row pattern having a nested Case Class. This dataframe is a result of a join, and has the schema shown below. It has some columns of primitive types, and 2 compound columns:

case class MyList(values: Seq[Integer])
case class MyItem(key1: String, key2: String, field1: Integer, group1: MyList, group2: MyList, field2: Integer)
val myLine1 = new MyItem ("MyKey01", "MyKey02", 1, new MyList(Seq(1)), new MyList(Seq(2)), 2)
val myLine2 = new MyItem ("YourKey01", "YourKey02", 2, new MyList(Seq(2,3)), new MyList(Seq(4,5)), 20)
val dfRaw = Seq(myLine1, myLine2).toDF
dfRaw.printSchema
dfRaw.show
val df2 = dfRaw.map(r => r match {
    case Row(key1: String, key2: String, field1: Integer, group1: MyList, group2: MyList, field2: Integer) => "Matched"
    case _ => "Un matched"
})
df2.show

My problem is, after that map function, all I got was "Un matched":

root
 |-- key1: string (nullable = true)
 |-- key2: string (nullable = true)
 |-- field1: integer (nullable = true)
 |-- group1: struct (nullable = true)
 |    |-- values: array (nullable = true)
 |    |    |-- element: integer (containsNull = true)
 |-- group2: struct (nullable = true)
 |    |-- values: array (nullable = true)
 |    |    |-- element: integer (containsNull = true)
 |-- field2: integer (nullable = true)
+---------+---------+------+--------------------+--------------------+------+
|     key1|     key2|field1|              group1|              group2|field2|
+---------+---------+------+--------------------+--------------------+------+
|  MyKey01|  MyKey02|     1|   [WrappedArray(1)]|   [WrappedArray(2)]|     2|
|YourKey01|YourKey02|     2|[WrappedArray(2, 3)]|[WrappedArray(4, 5)]|    20|
+---------+---------+------+--------------------+--------------------+------+
df2: org.apache.spark.sql.Dataset[String] = [value: string]
+----------+
|     value|
+----------+
|Un matched|
|Un matched|
+----------+

If I ignore those two struct columns in the case branch (replacing group1: MyList, group2: MyList with _, _, then it works

case Row(key1: String, key2: String, field1: Integer, group1: MyList, group2: MyList, field2: Integer) => "Matched"

Could you please help on how to do pattern matching on that Case class? Thanks!


Solution

  • struct columns are treated as org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema in spark

    so you will have to define match case as

    import org.apache.spark.sql.catalyst.expressions._
    val df2 = dfRaw.map(r => r match {
        case Row(key1: String, key2: String, field1: Integer, group1: GenericRowWithSchema, group2: GenericRowWithSchema, field2: Integer) => "Matched"
        case _ => "Un matched"
    })
    

    And defining match case with wild-card (_) works because Scala compiler implicitly evaluates org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema as datatype.

    Defining case as below should work too as with the wild-card due to implicit evaluation

    case Row(key1: String, key2: String, field1: Integer, group1, group2, field2: Integer) => "Matched"