I have the single-line JSON. trying to parse and store using Pyspark
Raw file content of 'path.json'
{"number": 34, "tool": {"name": "temp", "guid": null, "version": "2.13:1"}}
code in pyspark
>>> df = spark.read.options(multiline=True, dropFieldIfAllNull=False).json('path.json')
>>> df.show(1,False)
+------+--------------------+
|number|tool |
+------+--------------------+
|34 |{null, temp, 2.13.1}|
+------+--------------------+
>>> df.select('tool').show(1,False)
+--------------------+
|tool |
+--------------------+
|{null, temp, 2.13:1}|
+--------------------+
>>>
if you see the structure of the key "tool" is {null, temp, 2.13:1}
and when I am storing this to the database it is stored in the same way. However, I want to keep it in the proper key-value format like {"name": "temp", "guid": null, "version": "2.13;1"}
.
Storing to database
df2 = df_with_ts.withColumn('tool', col('tool').cast('string'))
df2.write \
.format("") \
.option("url", url) \
.option("dbtable", "temp") \
.option("tempdir", "path") \
.mode("append") \
.save()
I want this to store like
{"name": "temp", "guid": null, "version": "2.13;1"}
AFAIU, what you need is to stringify the tool
column, you can do that using the to_json
or concat
functions.
The tool
column is already a Struct
with key and value but packing the Struct into string takes only the list of values.
A more generic option to dump it as a json is to use to_json
from pyspark.sql.functions import to_json
df.withColumn('toolStr', to_json('tool', options={"ignoreNullFields":False})).show(10, False)
+------+--------------------+----------------------------------------------+
|number|tool |toolStr |
+------+--------------------+----------------------------------------------+
|34 |{null, temp, 2.13:1}|{"guid":null,"name":"temp","version":"2.13:1"}|
+------+--------------------+----------------------------------------------+
If you need a custom format while dumping the Struct, then you can use concat
function
from pyspark.sql.functions import concat, lit, when, isnull, col
df = spark.read.options(multiline=True, dropFieldIfAllNull=False).json('path.json')
df.withColumn('toolStr', concat(\
lit('{"name": '), when(isnull('tool.name'), lit('""')).otherwise(col('tool.name')),\
lit(', "guid": '), when(isnull('tool.guid'), lit('""')).otherwise(col('tool.guid')),\
lit(', "version": '), when(isnull('tool.version'), lit('""')).otherwise(col('tool.version')), lit('}')\
)).show(1,False)
+------+--------------------+---------------------------------------------+
|number|tool |toolStr |
+------+--------------------+---------------------------------------------+
|34 |{null, temp, 2.13:1}|{"name": temp, "guid": "", "version": 2.13:1}|
+------+--------------------+---------------------------------------------+