Search code examples
amazon-web-servicesapache-sparkpysparkapache-spark-sqlaws-glue

Exclude column from json string?


I have a field in dataframe that contains json. Is there a way to exclude some column from json?

Input:

{"column1":"data", "column2":"data"}

Expected output:

{"column1":"data"}

Solution

  • You can convert the json into MapType with from_json function, filter the map to exclude the columns you want, then convert back to json using to_json:

    import pyspark.sql.functions as F
    
    df = spark.createDataFrame([('{"column1":"data", "column2":"data"}',)], ["json_col"])
    
    cols_to_exclude = ["column2"]
    
    df1 = df.withColumn(
        "json_col",
        F.from_json("json_col", "map<string,string>")
    ).withColumn(
        "json_col",
        F.to_json(
            F.map_filter("json_col", lambda k, v: ~k.isin(cols_to_exclude))
        )
    )
    
    df1.show()
    #+------------------+
    #|          json_col|
    #+------------------+
    #|{"column1":"data"}|
    #+------------------+