I need help with following task:
I have obtained a groupby results whereby I have a list of multiple addresses (here just a cut out with one address) with people occupying those addresses. I need to calculate ratio of usage of the app, so that I am dividing [name] + Active Count / [name] + Passive Count and creating a new dataframe with [address][name][usage_ratio]
I have never done aggregation alike and I have no idea where to start or how to loop it to execute it. Can anyone help?
+------------+--------------------+----------------+-----+
| address| name| use_of_app|count|
+------------+--------------------+----------------+-----+
| 33| Mark| active| 35|
| 33| Mark| passive| 4|
| 33| Abbie| active| 30|
| 33| Abbie| passive| 2|
| 33| Anna| passive| 3|
| 33| Anna| active| 32|
| 33| Tom| passive| 38|
| 33| Tom| active| 50|
| 33| Patrick| passive| 40|
| 33| Patrick| active| 57|
+------------+--------------------+----------------+-----+
Another option would be a pivot
-step:
from pyspark.sql import function as F
(
df.groupby("address", "name")
.pivot("use_of_app", values=["active", "passive"])
.agg(F.sum("count"))
.withColumn("ratio", F.col("active") / F.col("passive"))
.show()
)
# Output
+-------+-----+------+-------+------------------+
|address| name|active|passive| ratio|
+-------+-----+------+-------+------------------+
| 33|Abbie| 30| 2| 15.0|
| 33| Anna| 32| 3|10.666666666666666|
| 33| Mark| 35| 4| 8.75|
+-------+-----+------+-------+------------------+
+++
Updated according to the suggestion by Steven: .pivot("use_of_app", values=["active", "passive"])
.