Search code examples
scalaapache-sparknullcase-class

How to represent nulls in DataSets consisting of list of case classes


I have a case class

final case class FieldStateData(
    job_id: String = null,
    job_base_step_id: String = null,
    field_id: String = null,
    data_id: String = null,
    data_value: String = null,
    executed_unit: String = null,
    is_doc: Boolean = null,
    mime_type: String = null,
    filename: String = null,
    filesize: BigInt = null,
    caption: String = null,
    executor_id: String = null,
    executor_name: String = null,
    executor_email: String = null,
    created_at: BigInt = null
)

That I want to use as part of a dataset of type Dataset[FieldStateData] to eventually insert into a database. All columns need to be nullable. How would I represent null types for numbers descended from Any rather than any string? I thought about using Option[Boolean] or something like that but will that automatically unbox during insertion or when it's used as a sql query?

Also note that the above code in not correct. Boolean types are not nullable. It's just an example.


Solution

  • You are correct to use Option Monad for in the case class. The field shall be unboxed by spark on read.

    import org.apache.spark.sql.{Encoder, Encoders, Dataset}
    
    final case class FieldStateData(job_id: Option[String],
                                    job_base_step_id: Option[String],
                                    field_id: Option[String],
                                    data_id: Option[String],
                                    data_value: Option[String],
                                    executed_unit: Option[String],
                                    is_doc: Option[Boolean],
                                    mime_type: Option[String],
                                    filename: Option[String],
                                   filesize: Option[BigInt],
                                   caption: Option[String],
                                   executor_id: Option[String],
                                   executor_name: Option[String],
                                   executor_email: Option[String],
                                   created_at: Option[BigInt])
    implicit val fieldCodec: Encoder[FieldStateData] = Encoders.product[FieldStateData]
    
    val ds: Dataset[FieldStateEncoder] = spark.read.source_name.as[FieldStateData]
    

    When you write the Dataset back into the database, None become null values and Some(x) are the values that are present.