I have a dataframe which contains Company name, EmpId, Bonus & Salary.
COMPANY | EMPID | BONUS | SALARY |
---|---|---|---|
APPLE | 1234 | No | 5 |
APPLE | 1235 | No | 7 |
6786 | Yes | 6 | |
6787 | No | 5 | |
6788 | No | 6 | |
TARGET | 9091 | Yes | 4 |
TARGET | 9092 | Yes | 1 |
TARGET | 9093 | Yes | 9 |
TARGET | 9094 | No | 2 |
TARGET | 9094 | Yes | 4 |
I want to group by Company name regardless of EmpIds and create 2 different columns which will display total sum of Bonus & salary received by company.
Expected output:
COMPANY | TOT_AMT_NO_BONUS | TOTAL_BONUS |
---|---|---|
APPLE | 12 | 0 |
11 | 6 | |
TARGET | 2 | 18 |
Tried with groupBy, Aggregate & Sum but not getting results as expected. Appreciate if someone can help. Thanks in advance.
You can groupby company, pivot on bonus, and sum salary, then fill any null entries with 0:
sparkDF.groupby('COMPANY').pivot('BONUS').sum('salary').withColumnRenamed(
'No', 'TOT_AMT_NO_BONUS',
).withColumnRenamed(
'Yes', 'TOTAL_BONUS'
).fillna(
value=0, subset=['TOT_AMT_NO_BONUS','TOTAL_BONUS']
)
+-------+----------------+-----------+
|COMPANY|TOT_AMT_NO_BONUS|TOTAL_BONUS|
+-------+----------------+-----------+
| APPLE| 12| 0|
| TARGET| 2| 18|
| GOOGLE| 11| 6|
+-------+----------------+-----------+