Search code examples
pythonpysparkdatabricks

how to combine array of map in a single map per column in pyspark


i have followed this question but the answers there not working for me i don't want a UDF for this and map_concat doesn't work for me. is there any other way to combine maps?

eg

id value
1 Map(k1 -> v1)
2 Map(k2 -> v2)

output should be

id value
1 Map(k1 -> v1, k2 -> v2)

Solution

  • Here is my solution, I'm assuming that we can drop id

    from pyspark.sql import functions as f
    from pyspark.sql import SparkSession
    spark = SparkSession.builder.appName('test').getOrCreate()
    
    data = [{'id':1, 'map':{'k1': 'v1'}}, {'id':2, 'map':{'k2': 'v2'}}, {'id':3, 'map':{'k3': 'v3'}}]
    df = spark.createDataFrame(data)
    
    # removing id , adding grouping column
    d_df = df.drop('id').withColumn('group_id', f.lit(1)) 
    
    # aggregating into array of maps
    g_df = d_df.groupBy('group_id')\
        .agg(f.collect_list('map').alias('maps'))
    
    # concating the maps
    final_df = g_df.select(f.aggregate('maps', f.create_map().cast("map<string,string>"), lambda acc, i: f.map_concat(acc, i)).alias('map_of_maps'))
    final_df.show()
    

    Result:

    +--------------------+
    |         map_of_maps|
    +--------------------+
    |{k1 -> v1, k2 -> ...|
    +--------------------+