I'm trying to write a unit test that matches my data output, but struggling to create a sample dataframe of the right format.
The schema needs to look like this:
|-- ids: string (nullable = true)
|-- scores: map (nullable = true)
| |-- key: string
| |-- value: struct (valueContainsNull = false)
| | |-- myscore1: double (nullable = true)
| | |-- myscore2: double (nullable = true)
and the output for one row should look ex like:
+-----+-----------------------------------------+
|ids |scores |
+-----+-----------------------------------------+
|id_1|{key1 -> {0.7, 1.3}, key2 -> {0.5, 1.2}} |
+-----+-----------------------------------------+
My best attempt so far is like this but it is still giving null for the scores column...What am I missing?
val exDf = List[(String, Option[String])](("id_1", Some("{\"key1\":Row(0.7, 1.3), \"key2\":Row(0.5, 1.2)}"))).toDF("ids", "scores")
.withColumn("scores",from_json(col("scores"), MapType(StringType, StructType(Array(StructField("myscore1", DoubleType), StructField("myscore2", DoubleType))))))
I've tried a number of variations on the syntax of my exDf, and a number of variations of the schema defined, but I always get a null
output for the scores column. I'm running in scala on spark 3.3.1
It's easier to let Scala infer the dataframe column types. For the Struct
type scores
, just create a case class with optional Double
type fields to make them nullable.
case class ScoreVal(myscore1: Option[Double], myscore2: Option[Double])
val exDf = Seq(
("id_1", Map("key1" -> ScoreVal(Some(0.7), Some(1.3)), "key2" -> ScoreVal(Some(0.5), Some(1.2)))),
("id_2", Map("key3" -> ScoreVal(Some(2.0), None)))
).toDF("ids", "scores")
exDf.printSchema
root
|-- ids: string (nullable = true)
|-- scores: map (nullable = true)
| |-- key: string
| |-- value: struct (valueContainsNull = true)
| | |-- myscore1: double (nullable = true)
| | |-- myscore2: double (nullable = true)