There is a very similar question to this one for pandas
my Dataframe is like the following:
ID Month Metric1 Metric2
1 01 4 3
3 01 4 2
3 01 3 1
1 02 3 2
1 02 1 5
2 02 6 3
2 02 7 2
3 03 5 6
I want to first group by Month and then by ID (for the dropping, the result should not be grouped). Lastly I would like to drop all rows whose id or month appears only once.
ID Month Metric1 Metric2
- -- - -
3 01 4 2
3 01 3 1
1 02 3 2
1 02 1 5
2 02 6 3
2 02 7 2
- -- - -
so the same problem as the question I linked above, only in PySpark and with two groupings...
I have been trying to convert the pandas command into pyspark, but I don't know how to replace the [], as pyspark doesn't seem to know them...
df2 = df[df.groupBy("ID", "Month")].ID.count("ID") >1]
Also, I tried window functions. But they always end up needing am aggregation function and I don't want the data to be aggregated:
from pyspark.sql.window import Window as W
window = W.partitionBy("Month").orderBy("ID")
df.withColumn("ID",Moth.over(window)) \
.show()
Any help would be appreciated!
get counts partitioned by Id and Month and eliminate row having count as 1:
from pyspark.sql import functions as F, Window as W
w = W.partitionBy("ID","Month").orderBy("Month")
out = df.withColumn("C",F.count("ID").over(w)).filter("C!=1").drop("C")