Search code examples
pythonpysparkapache-spark-sqlazure-databricks

Converting Table columns and values to nested JSON


I have a CSV file with columns as below

enter image description here

Required result in Nested JSON objects as below:

{
"Type" : "A"
"Value" :"1"

}

{
"Type" : "B"
"Value" :"1"

}

I tried the below code : Any help would be deeply appreciated.

from pyspark.sql.functions import *

    list1=['A','B','C']
    df2=spark.sql("select * from test   limit  10" )
    df3=df2.select(list1)
    for i in range(0,len(list1)):
      df4=df3.withColumn("Type",lit(list1[i]))

Solution

  • This can be done by trick of UnPivoting...

    Suppose you have dataset like below.. Let's call it a Patient's test results.. Column A,B,C means.. Test Type A , Test Type B, ... and values in those Columns mean test results in numeric

    +-------------+---+---+---+---+---+---+---+
    |PatientNumber|  A|  B|  C|  D|  E|  F|  G|
    +-------------+---+---+---+---+---+---+---+
    |          101|  1|  2|  3|  4|  5|  6|  7|
    |          102| 11| 12| 13| 14| 15| 16| 17|
    +-------------+---+---+---+---+---+---+---+
    

    I've added a PatientNumber column just for making data look more sensible. You can remove that from your code.

    I added this dataset to a csv..

    val testDF = spark.read.format("csv").option("header", "true").load("""C:\TestData\CSVtoJSon.csv""")
    

    Let's create 2 Arrays, one for id columns, and another for all test Types..

    val idCols = Array("PatientNumber")
    
    val valCols = testDF.columns.diff(idCols)
    

    Then here's the code to Unpivot

    val valcolNames = valCols.map(x => List(''' + x + ''', x))
    val unPivotedDF = testDF.select($"PatientNumber", expr(s"""stack(${valCols.size},${valcolNames.flatMap(x => x).mkString(",")} ) as (Type,Value)"""))
    

    Here's how this Unpivoted data looks like -

    +-------------+----+-----+
    |PatientNumber|Type|Value|
    +-------------+----+-----+
    |          101|   A|    1|
    |          101|   B|    2|
    |          101|   C|    3|
    |          101|   D|    4|
    |          101|   E|    5|
    |          101|   F|    6|
    |          101|   G|    7|
    |          102|   A|   11|
    |          102|   B|   12|
    |          102|   C|   13|
    |          102|   D|   14|
    |          102|   E|   15|
    |          102|   F|   16|
    |          102|   G|   17|
    +-------------+----+-----+
    

    Finally write this Unpivoted DF as Json -

    unPivotedDF.coalesce(1).write.format("json").mode("Overwrite").save("""C:\TestData\output""")
    

    Content of Json File looks same as your desired result -

    {"PatientNumber":"101","Type":"A","Value":"1"}
    {"PatientNumber":"101","Type":"B","Value":"2"}
    {"PatientNumber":"101","Type":"C","Value":"3"}
    {"PatientNumber":"101","Type":"D","Value":"4"}
    {"PatientNumber":"101","Type":"E","Value":"5"}
    {"PatientNumber":"101","Type":"F","Value":"6"}
    {"PatientNumber":"101","Type":"G","Value":"7"}
    {"PatientNumber":"102","Type":"A","Value":"11"}
    {"PatientNumber":"102","Type":"B","Value":"12"}
    {"PatientNumber":"102","Type":"C","Value":"13"}
    {"PatientNumber":"102","Type":"D","Value":"14"}
    {"PatientNumber":"102","Type":"E","Value":"15"}
    {"PatientNumber":"102","Type":"F","Value":"16"}
    {"PatientNumber":"102","Type":"G","Value":"17"}