apache-spark apache-spark-sql databricks azure-databricks

How to parse a JSON containing string property representing JSON

I have many JSONs with structure as followed.

{
    "p1":"v1",
    "p2":"v2",
    "p3":"v3",
    "modules": "{ \"nest11\":\"n1v1\", \"nest12\":\"n1v2\", \"nest13\": { \"nest21\": \"n2v1\" }  }"
}

How to parse it to this?

v1, v2, v3, n1v1, n1v2, n2v1

It is not a problem to extract "v1, v2, v3", but how to access "n1v1, n1v2, n2v1" With Spark Data Frame API

Solution

One approach is to use the DataFrameFlattener implicit class found in the official databricks site.

First you will need to define the JSON schema for the modules column then you flatten the dataframe as shown below. Here I assume that the file test_json.txt will have the next content:

{
    "p1":"v1",
    "p2":"v2",
    "p3":"v3",
    "modules": "{ \"nest11\":\"n1v1\", \"nest12\":\"n1v2\", \"nest13\": { \"nest21\": \"n2v1\" }  }"
}

Here is the code:

import org.apache.spark.sql.functions.col
import org.apache.spark.sql.{Column, DataFrame}
import org.apache.spark.sql.types.{DataType, StructType, StringType}

implicit class DataFrameFlattener(df: DataFrame) {
  def flattenSchema: DataFrame = {
    df.select(flatten(Nil, df.schema): _*)
  }

  protected def flatten(path: Seq[String], schema: DataType): Seq[Column] = schema match {
    case s: StructType => s.fields.flatMap(f => flatten(path :+ f.name, f.dataType))
    case other => col(path.map(n => s"`$n`").mkString(".")).as(path.mkString(".")) :: Nil
  }
}

val schema = (new StructType)
  .add("nest11", StringType)
  .add("nest12", StringType)
  .add("nest13", (new StructType).add("nest21", StringType, false))

val df = spark.read
  .option("multiLine", true).option("mode", "PERMISSIVE")
  .json("C:\\temp\\test_json.txt")

df.withColumn("modules", from_json($"modules", schema))
  .select($"*")
  .flattenSchema

And this should be the output:

+--------------+--------------+---------------------+---+---+---+
|modules.nest11|modules.nest12|modules.nest13.nest21|p1 |p2 |p3 |
+--------------+--------------+---------------------+---+---+---+
|n1v1          |n1v2          |n2v1                 |v1 |v2 |v3 |
+--------------+--------------+---------------------+---+---+---+

Please let me know if you need further clarification.