Search code examples
pythondataframepysparkdrop

Remove rows from PySpark dataframe where value only appears once


There is a very similar question to this one for pandas

my Dataframe is like the following:

ID    Month   Metric1    Metric2
1       01       4          3
3       01       4          2
3       01       3          1      
1       02       3          2
1       02       1          5
2       02       6          3
2       02       7          2
3       03       5          6

I want to first group by Month and then by ID (for the dropping, the result should not be grouped). Lastly I would like to drop all rows whose id or month appears only once.

ID    Month   Metric1    Metric2
-       --       -          -
3       01       4          2
3       01       3          1      
1       02       3          2
1       02       1          5
2       02       6          3
2       02       7          2
-       --       -          -

so the same problem as the question I linked above, only in PySpark and with two groupings...

I have been trying to convert the pandas command into pyspark, but I don't know how to replace the [], as pyspark doesn't seem to know them...

df2 = df[df.groupBy("ID", "Month")].ID.count("ID") >1]

Also, I tried window functions. But they always end up needing am aggregation function and I don't want the data to be aggregated:

from pyspark.sql.window import Window as W
window = W.partitionBy("Month").orderBy("ID")
df.withColumn("ID",Moth.over(window)) \
    .show()

Any help would be appreciated!


Solution

  • get counts partitioned by Id and Month and eliminate row having count as 1:

    from pyspark.sql import functions as F, Window as W
    
    w = W.partitionBy("ID","Month").orderBy("Month")
    out = df.withColumn("C",F.count("ID").over(w)).filter("C!=1").drop("C")