Search code examples
arraysjsonscalaapache-sparkfromjson

Convert string column to JSON map of structs in scala


I'm trying to write a unit test that matches my data output, but struggling to create a sample dataframe of the right format.

The schema needs to look like this:

    |-- ids: string (nullable = true)
    |-- scores: map (nullable = true)
    |    |-- key: string
    |    |-- value: struct (valueContainsNull = false)
    |    |    |-- myscore1: double (nullable = true)
    |    |    |-- myscore2: double (nullable = true)

and the output for one row should look ex like:

    +-----+-----------------------------------------+
    |ids  |scores                                   |
    +-----+-----------------------------------------+
    |id_1|{key1 -> {0.7, 1.3}, key2 -> {0.5, 1.2}}  |
    +-----+-----------------------------------------+

My best attempt so far is like this but it is still giving null for the scores column...What am I missing?

val exDf = List[(String, Option[String])](("id_1", Some("{\"key1\":Row(0.7, 1.3), \"key2\":Row(0.5, 1.2)}"))).toDF("ids", "scores")
.withColumn("scores",from_json(col("scores"), MapType(StringType, StructType(Array(StructField("myscore1", DoubleType), StructField("myscore2", DoubleType))))))

I've tried a number of variations on the syntax of my exDf, and a number of variations of the schema defined, but I always get a null output for the scores column. I'm running in scala on spark 3.3.1


Solution

  • It's easier to let Scala infer the dataframe column types. For the Struct type scores, just create a case class with optional Double type fields to make them nullable.

    case class ScoreVal(myscore1: Option[Double], myscore2: Option[Double])
    
    val exDf = Seq(
      ("id_1", Map("key1" -> ScoreVal(Some(0.7), Some(1.3)), "key2" -> ScoreVal(Some(0.5), Some(1.2)))),
      ("id_2", Map("key3" -> ScoreVal(Some(2.0), None)))
    ).toDF("ids", "scores")
    
    exDf.printSchema
    root
     |-- ids: string (nullable = true)
     |-- scores: map (nullable = true)
     |    |-- key: string
     |    |-- value: struct (valueContainsNull = true)
     |    |    |-- myscore1: double (nullable = true)
     |    |    |-- myscore2: double (nullable = true)