Search code examples
pysparkfiltergroup-by

How to filter by max date inside groupBy in PySpark


I have data frame in PySpark with numeric field "value", date field "date" and year of date field "date_year". I would like to keep inside each "date_year" field only one row with max date.

Here is the snippet

from pyspark.sql import functions as f

df_orders = df_orders \
    .groupBy("company_id", "date_year") \
    .filter(f.col("date") == f.max(f.col("date")))

So, each company will have one max date per each year only.

This code unfortunately not works.

Thanks!


Solution

  • Rank the dates in desc order per company_id and date_year then filter the rows where rank is 1

    W = Window.partitionBy('company_id', 'date_year').orderBy(F.col('date').desc())
    result = df.withColumn('rank', F.dense_rank().over(W)).filter('rank = 1')