Search code examples
apache-sparkapache-spark-sqldatabricksazure-databricks

How to parse a JSON containing string property representing JSON


I have many JSONs with structure as followed.

{
    "p1":"v1",
    "p2":"v2",
    "p3":"v3",
    "modules": "{ \"nest11\":\"n1v1\", \"nest12\":\"n1v2\", \"nest13\": { \"nest21\": \"n2v1\" }  }"
}

How to parse it to this?

v1, v2, v3, n1v1, n1v2, n2v1

It is not a problem to extract "v1, v2, v3", but how to access "n1v1, n1v2, n2v1" With Spark Data Frame API


Solution

  • One approach is to use the DataFrameFlattener implicit class found in the official databricks site.

    First you will need to define the JSON schema for the modules column then you flatten the dataframe as shown below. Here I assume that the file test_json.txt will have the next content:

    {
        "p1":"v1",
        "p2":"v2",
        "p3":"v3",
        "modules": "{ \"nest11\":\"n1v1\", \"nest12\":\"n1v2\", \"nest13\": { \"nest21\": \"n2v1\" }  }"
    }
    

    Here is the code:

    import org.apache.spark.sql.functions.col
    import org.apache.spark.sql.{Column, DataFrame}
    import org.apache.spark.sql.types.{DataType, StructType, StringType}
    
    implicit class DataFrameFlattener(df: DataFrame) {
      def flattenSchema: DataFrame = {
        df.select(flatten(Nil, df.schema): _*)
      }
    
      protected def flatten(path: Seq[String], schema: DataType): Seq[Column] = schema match {
        case s: StructType => s.fields.flatMap(f => flatten(path :+ f.name, f.dataType))
        case other => col(path.map(n => s"`$n`").mkString(".")).as(path.mkString(".")) :: Nil
      }
    }
    
    val schema = (new StructType)
      .add("nest11", StringType)
      .add("nest12", StringType)
      .add("nest13", (new StructType).add("nest21", StringType, false))
    
    val df = spark.read
      .option("multiLine", true).option("mode", "PERMISSIVE")
      .json("C:\\temp\\test_json.txt")
    
    df.withColumn("modules", from_json($"modules", schema))
      .select($"*")
      .flattenSchema
    

    And this should be the output:

    +--------------+--------------+---------------------+---+---+---+
    |modules.nest11|modules.nest12|modules.nest13.nest21|p1 |p2 |p3 |
    +--------------+--------------+---------------------+---+---+---+
    |n1v1          |n1v2          |n2v1                 |v1 |v2 |v3 |
    +--------------+--------------+---------------------+---+---+---+
    

    Please let me know if you need further clarification.