dataframe apache-spark pyspark aws-glue pyspark-pandas

PySpark: Groupby within groups and display sum in separate fields based on certain values

I have a dataframe which contains Company name, EmpId, Bonus & Salary.

COMPANY	EMPID	BONUS	SALARY
APPLE	1234	No	5
APPLE	1235	No	7
GOOGLE	6786	Yes	6
GOOGLE	6787	No	5
GOOGLE	6788	No	6
TARGET	9091	Yes	4
TARGET	9092	Yes	1
TARGET	9093	Yes	9
TARGET	9094	No	2
TARGET	9094	Yes	4

I want to group by Company name regardless of EmpIds and create 2 different columns which will display total sum of Bonus & salary received by company.

Expected output:

COMPANY	TOT_AMT_NO_BONUS	TOTAL_BONUS
APPLE	12	0
GOOGLE	11	6
TARGET	2	18

Tried with groupBy, Aggregate & Sum but not getting results as expected. Appreciate if someone can help. Thanks in advance.

Solution

You can groupby company, pivot on bonus, and sum salary, then fill any null entries with 0:

sparkDF.groupby('COMPANY').pivot('BONUS').sum('salary').withColumnRenamed(
    'No', 'TOT_AMT_NO_BONUS',
).withColumnRenamed(
    'Yes', 'TOTAL_BONUS'
).fillna(
    value=0, subset=['TOT_AMT_NO_BONUS','TOTAL_BONUS']
)

+-------+----------------+-----------+
|COMPANY|TOT_AMT_NO_BONUS|TOTAL_BONUS|
+-------+----------------+-----------+
|  APPLE|              12|          0|
| TARGET|               2|         18|
| GOOGLE|              11|          6|
+-------+----------------+-----------+