Search code examples

Convert PySpark data frame to dictionary after grouping the elements in the column as key

I have below PySpark data frame:

ID Value
1 value-1
1 value-2
1 value-3
2 value-1
2 value-2

I want to convert it into a dictionary:

dict1 = {'1':['value-1','value-2','value-3'], '2':['value-1','value-2']}

I was able to do it (wrote an answer below) but I need much simpler and efficient way without converting the data frame to Pandas.


  • Native Spark approach, using rdd.collectAsMap:

    from pyspark.sql.functions import collect_list

    An approach using Pandas' groupby and to_dict:

    # Convert to Pandas data frame
    df_pandas = df_spark.toPandas()

    {'1': ['value-1', 'value-2', 'value-3'], '2': ['value-1', 'value-2']}