Search code examples
pythonpandasdataframepyspark

Convert PySpark data frame to dictionary after grouping the elements in the column as key


I have below PySpark data frame:

ID Value
1 value-1
1 value-2
1 value-3
2 value-1
2 value-2

I want to convert it into a dictionary:

dict1 = {'1':['value-1','value-2','value-3'], '2':['value-1','value-2']}

I was able to do it (wrote an answer below) but I need much simpler and efficient way without converting the data frame to Pandas.


Solution

  • Native Spark approach, using rdd.collectAsMap:

    from pyspark.sql.functions import collect_list
    
    df_spark.groupBy("ID").agg(collect_list("Value")).rdd.collectAsMap()
    

    An approach using Pandas' groupby and to_dict:

    # Convert to Pandas data frame
    df_pandas = df_spark.toPandas()
    
    df_pandas.groupby("ID")["Value"].apply(list).to_dict()
    

    {'1': ['value-1', 'value-2', 'value-3'], '2': ['value-1', 'value-2']}