Search code examples
python-3.xapache-sparkpyspark

pyspark parsing nested json ignoring all the key


I have the single-line JSON. trying to parse and store using Pyspark

Raw file content of 'path.json'

{"number": 34, "tool": {"name": "temp", "guid": null, "version": "2.13:1"}}

code in pyspark

>>> df = spark.read.options(multiline=True, dropFieldIfAllNull=False).json('path.json')
>>> df.show(1,False)
+------+--------------------+
|number|tool                |
+------+--------------------+
|34    |{null, temp, 2.13.1}|
+------+--------------------+


>>> df.select('tool').show(1,False)
+--------------------+
|tool                |
+--------------------+
|{null, temp, 2.13:1}|
+--------------------+

>>>

if you see the structure of the key "tool" is {null, temp, 2.13:1} and when I am storing this to the database it is stored in the same way. However, I want to keep it in the proper key-value format like {"name": "temp", "guid": null, "version": "2.13;1"}.

Storing to database

df2 = df_with_ts.withColumn('tool', col('tool').cast('string'))

df2.write \
    .format("") \
    .option("url", url) \
    .option("dbtable", "temp") \
    .option("tempdir", "path") \
    .mode("append") \
    .save()
   

I want this to store like

{"name": "temp", "guid": null, "version": "2.13;1"}

Solution

  • AFAIU, what you need is to stringify the tool column, you can do that using the to_json or concat functions.

    The tool column is already a Struct with key and value but packing the Struct into string takes only the list of values.

    A more generic option to dump it as a json is to use to_json

    from pyspark.sql.functions import to_json
    
    df.withColumn('toolStr', to_json('tool', options={"ignoreNullFields":False})).show(10, False)
    
    +------+--------------------+----------------------------------------------+
    |number|tool                |toolStr                                       |
    +------+--------------------+----------------------------------------------+
    |34    |{null, temp, 2.13:1}|{"guid":null,"name":"temp","version":"2.13:1"}|
    +------+--------------------+----------------------------------------------+
    

    If you need a custom format while dumping the Struct, then you can use concat function

    from pyspark.sql.functions import concat, lit, when, isnull, col
    
    df = spark.read.options(multiline=True, dropFieldIfAllNull=False).json('path.json')
    
    df.withColumn('toolStr', concat(\
        lit('{"name": '), when(isnull('tool.name'), lit('""')).otherwise(col('tool.name')),\
        lit(', "guid": '), when(isnull('tool.guid'), lit('""')).otherwise(col('tool.guid')),\
        lit(', "version": '), when(isnull('tool.version'), lit('""')).otherwise(col('tool.version')), lit('}')\
    )).show(1,False)
    
    +------+--------------------+---------------------------------------------+
    |number|tool                |toolStr                                      |
    +------+--------------------+---------------------------------------------+
    |34    |{null, temp, 2.13:1}|{"name": temp, "guid": "", "version": 2.13:1}|
    +------+--------------------+---------------------------------------------+