I have data frame in PySpark with numeric field "value", date field "date" and year of date field "date_year". I would like to keep inside each "date_year" field only one row with max date.
Here is the snippet
from pyspark.sql import functions as f
df_orders = df_orders \
.groupBy("company_id", "date_year") \
.filter(f.col("date") == f.max(f.col("date")))
So, each company will have one max date per each year only.
This code unfortunately not works.
Thanks!
Rank the dates in desc order per company_id
and date_year
then filter
the rows where rank is 1
W = Window.partitionBy('company_id', 'date_year').orderBy(F.col('date').desc())
result = df.withColumn('rank', F.dense_rank().over(W)).filter('rank = 1')