Search code examples
scalaapache-sparkencodinglocaldate

DataSet encoding of map and sequence of LocalDate problems


I'm writing a solution in Scala based on Spark 3.5.0 that has complex map/seq data structures of dates where I need to represent missing dates with null or None. As this is scala my preference is None, however, I get an exception trying to encode Option[LocalDate] in a sequence or map. As a workaround, I switched to Null but I can't do this in a map as Null can't be used as a key.

This is code that reproduces the problem:

case class SubClass[T](i: Option[T])

case class TestClass[T](i: Option[T] = None, st: Option[SubClass[T]], sq: Option[Seq[Option[T]]] = None)

  it should "handle sequence of date" in {
    val sparkSession = spark
    import sparkSession.implicits._
    val aDate = LocalDate.parse("2024-04-29")
    
    var ds = Seq(TestClass[LocalDate](Some(aDate), Some(SubClass[LocalDate](Some(aDate))), Some(Seq(Some(aDate))))).toDS    
    
    ds.show
  }

  it should "handle date structure" in {
    val sparkSession = spark
    import sparkSession.implicits._
    val aDate = LocalDate.parse("2024-04-29")
    
    var ds = Seq(TestClass[LocalDate](Some(aDate), Some(SubClass[LocalDate](Some(aDate))))).toDS    
    
    ds.show
  }


  it should "handle sequence of double" in {
    val sparkSession = spark
    import sparkSession.implicits._
    val aNum = 2.42
    
    var ds = Seq(TestClass[Double](Some(aNum), Some(SubClass[Double](Some(aNum))), Some(Seq(Some(aNum))))).toDS    
    
    ds.show
  }

  it should "handle sequence of string" in {
    val sparkSession = spark
    import sparkSession.implicits._
    val aString = "Foo"
    
    var ds = Seq(TestClass[String](Some(aString), Some(SubClass[String](Some(aString))), Some(Seq(Some(aString))))).toDS    
    
    ds.show
  }

For the above tests, they all work except should handle sequence of date which fails with exception org.apache.spark.SparkRuntimeException: Error while encoding: java.lang.RuntimeException: scala.Some is not a valid external type for schema of date.

This only seems to be an issue with LocalDate as the Double and String tests work just fine.

Can anyone suggest a workaround or fix for this?

Thanks,

David


Solution

  • I assume this is 3.5.0, on that version I can reproduce, but on 3.5.1 it no longer occurs.

    Probably this is triggered by https://issues.apache.org/jira/browse/SPARK-45896

    (the fix is present on Databricks 14.3 LTS but not on lower versions)