I have below PySpark data frame:
ID | Value |
---|---|
1 | value-1 |
1 | value-2 |
1 | value-3 |
2 | value-1 |
2 | value-2 |
I want to convert it into a dictionary:
dict1 = {'1':['value-1','value-2','value-3'], '2':['value-1','value-2']}
I was able to do it (wrote an answer below) but I need much simpler and efficient way without converting the data frame to Pandas.
Native Spark approach, using rdd.collectAsMap
:
from pyspark.sql.functions import collect_list
df_spark.groupBy("ID").agg(collect_list("Value")).rdd.collectAsMap()
An approach using Pandas' groupby
and to_dict
:
# Convert to Pandas data frame
df_pandas = df_spark.toPandas()
df_pandas.groupby("ID")["Value"].apply(list).to_dict()
{'1': ['value-1', 'value-2', 'value-3'], '2': ['value-1', 'value-2']}