Search code examples
dataframeapache-sparkpysparkaws-gluepyspark-pandas

PySpark: Groupby within groups and display sum in separate fields based on certain values


I have a dataframe which contains Company name, EmpId, Bonus & Salary.

COMPANY EMPID BONUS SALARY
APPLE 1234 No 5
APPLE 1235 No 7
GOOGLE 6786 Yes 6
GOOGLE 6787 No 5
GOOGLE 6788 No 6
TARGET 9091 Yes 4
TARGET 9092 Yes 1
TARGET 9093 Yes 9
TARGET 9094 No 2
TARGET 9094 Yes 4

I want to group by Company name regardless of EmpIds and create 2 different columns which will display total sum of Bonus & salary received by company.

Expected output:

COMPANY TOT_AMT_NO_BONUS TOTAL_BONUS
APPLE 12 0
GOOGLE 11 6
TARGET 2 18

Tried with groupBy, Aggregate & Sum but not getting results as expected. Appreciate if someone can help. Thanks in advance.


Solution

  • You can groupby company, pivot on bonus, and sum salary, then fill any null entries with 0:

    sparkDF.groupby('COMPANY').pivot('BONUS').sum('salary').withColumnRenamed(
        'No', 'TOT_AMT_NO_BONUS',
    ).withColumnRenamed(
        'Yes', 'TOTAL_BONUS'
    ).fillna(
        value=0, subset=['TOT_AMT_NO_BONUS','TOTAL_BONUS']
    )
    
    +-------+----------------+-----------+
    |COMPANY|TOT_AMT_NO_BONUS|TOTAL_BONUS|
    +-------+----------------+-----------+
    |  APPLE|              12|          0|
    | TARGET|               2|         18|
    | GOOGLE|              11|          6|
    +-------+----------------+-----------+