Search code examples
pysparkgroup-byaggregation

Aggregation of a data frame based on condition (Pyspark)


I need help with following task:
I have obtained a groupby results whereby I have a list of multiple addresses (here just a cut out with one address) with people occupying those addresses. I need to calculate ratio of usage of the app, so that I am dividing [name] + Active Count / [name] + Passive Count and creating a new dataframe with [address][name][usage_ratio]

I have never done aggregation alike and I have no idea where to start or how to loop it to execute it. Can anyone help?

+------------+--------------------+----------------+-----+
|     address|                name|      use_of_app|count|
+------------+--------------------+----------------+-----+
|          33|                Mark|          active|   35|
|          33|                Mark|         passive|    4|
|          33|               Abbie|          active|   30|
|          33|               Abbie|         passive|    2|
|          33|                Anna|         passive|    3|
|          33|                Anna|          active|   32|
|          33|                 Tom|         passive|   38|
|          33|                 Tom|          active|   50|
|          33|             Patrick|         passive|   40|
|          33|             Patrick|          active|   57|
+------------+--------------------+----------------+-----+

Solution

  • Another option would be a pivot-step:

    from pyspark.sql import function as F
    
    (
        df.groupby("address", "name")
        .pivot("use_of_app", values=["active", "passive"])
        .agg(F.sum("count"))
        .withColumn("ratio", F.col("active") / F.col("passive"))
        .show()
    )
    # Output
    +-------+-----+------+-------+------------------+
    |address| name|active|passive|             ratio|
    +-------+-----+------+-------+------------------+
    |     33|Abbie|    30|      2|              15.0|
    |     33| Anna|    32|      3|10.666666666666666|
    |     33| Mark|    35|      4|              8.75|
    +-------+-----+------+-------+------------------+
    
    

    +++ Updated according to the suggestion by Steven: .pivot("use_of_app", values=["active", "passive"]).