Search code examples
pythonpysparkgroup-byapache-spark-sqlaggregate

PySpark DataFrame groupby into list of values?


Simply, let's say I had the following DataFrame:

+-------------+----------+------+
|employee_name|department|salary|
+-------------+----------+------+
|        James|     Sales|  3000|
|      Michael|     Sales|  4600|
|       Robert|     Sales|  4100|
|        Maria|   Finance|  3000|
|        James|     Sales|  3000|
|        Scott|   Finance|  3300|
|          Jen|   Finance|  3900|
|         Jeff| Marketing|  3000|
|        Kumar| Marketing|  2000|
|         Saif|     Sales|  4100|
+-------------+----------+------+

How could I group by department and get all other values into a list, as follows:

department employee_name salary
Sales [James, Michael, Robert, James, Saif] [3000, 4600, 4100, 3000, 4100]
Finance [Maria, Scott, Jen] [3000, 3300, 3900]
Marketing [Jeff, Kumar] [3000, 2000]

Solution

  • Use collect_list with groupBy clause

    from pyspark.sql.functions import *
    
    df.groupBy(col("department")).agg(collect_list(col("employee_name")).alias("employee_name"),collect_list(col("employee_name")).alias("salary"))