Search code examples
apache-sparkpyspark

How to flatten a list of dicts into one dict in PySpark


I want to be able to take a dataframe like this with a features column containing a list of dicts:

{"id": 1, "label": 1, "features": [{"key1": 1}, {"key2": "dog"}, {"key4": "jane"}]}
{"id": 3, "label": 0, "features": [{"key3": 3}, {"key1": 3}]}
{"id": 4, "label": 1, "features": [{"key1": 4}, {"key2": "bird"}]}
{"id": 2, "label": 1, "features": [{"key2": 2}, {"key3": 2}]}
{"id": 5, "label": 0, "features": [{"key2": "cat"}, {"key3": 5}]}
{"id": 6, "label": 1, "features": [{"key3": 6}, {"key1": 6}]}

and flatten the list to one dict:

{"id": 1, "label": 1, "features": {"key1": 1, "key2": "dog", "key4": "jane"}}
{"id": 3, "label": 0, "features": {"key3": 3, "key1": 3}}
{"id": 4, "label": 1, "features": {"key1": 4, "key2": "bird"}}
{"id": 2, "label": 1, "features": {"key2": 2, "key3": 2}}
{"id": 5, "label": 0, "features": {"key2": "cat", "key3": 5}}
{"id": 6, "label": 1, "features": {"key3": 6, "key1": 6}}

In the actual data there will be hundreds of keys, so I'd like to avoid explicitly naming them in the code. I haven't been able to figure out how to do this. ChatGPT rabbit hole didn't get me anywhere. Any help appreciated!


Solution

  • map_from_entries is probably the function you need. When you load this JSON in dataframe, your features column will be an array of structs, so you just do:

    df.withColumn("features", map_from_entries(col("features")))