amazon-web-services apache-spark pyspark apache-spark-sql aws-glue

Exclude column from json string?

I have a field in dataframe that contains json. Is there a way to exclude some column from json?

Input:

{"column1":"data", "column2":"data"}

Expected output:

{"column1":"data"}

Solution

You can convert the json into MapType with from_json function, filter the map to exclude the columns you want, then convert back to json using to_json:

import pyspark.sql.functions as F

df = spark.createDataFrame([('{"column1":"data", "column2":"data"}',)], ["json_col"])

cols_to_exclude = ["column2"]

df1 = df.withColumn(
    "json_col",
    F.from_json("json_col", "map<string,string>")
).withColumn(
    "json_col",
    F.to_json(
        F.map_filter("json_col", lambda k, v: ~k.isin(cols_to_exclude))
    )
)

df1.show()
#+------------------+
#|          json_col|
#+------------------+
#|{"column1":"data"}|
#+------------------+