I have a field in dataframe that contains json. Is there a way to exclude some column from json?
Input:
{"column1":"data", "column2":"data"}
Expected output:
{"column1":"data"}
You can convert the json into MapType with from_json
function, filter the map to exclude the columns you want, then convert back to json using to_json
:
import pyspark.sql.functions as F
df = spark.createDataFrame([('{"column1":"data", "column2":"data"}',)], ["json_col"])
cols_to_exclude = ["column2"]
df1 = df.withColumn(
"json_col",
F.from_json("json_col", "map<string,string>")
).withColumn(
"json_col",
F.to_json(
F.map_filter("json_col", lambda k, v: ~k.isin(cols_to_exclude))
)
)
df1.show()
#+------------------+
#| json_col|
#+------------------+
#|{"column1":"data"}|
#+------------------+