I have a dataframe df
which contains daily data for many id
s, sample:
| yyyy_mm_dd | id | availability |
|------------|------|--------------|
| 2020-01-01 | 1334 | 300 |
| 2020-01-02 | 1334 | 676 |
| 2020-01-03 | 1334 | 463 |
| ... | 1334 | ... |
| 2020-03-29 | 1334 | 564 |
| 2020-03-30 | 1334 | 765 |
| 2020-03-31 | 1334 | 7564 |
| ... | 1334 | ... |
| 2020-06-28 | 1334 | 4634 |
| 2020-06-29 | 1334 | 65 |
| 2020-06-30 | 1334 | 643 |
| ... | 1334 | ... |
How could I identify the maximum availability
in the last N days prior to end of a quarter? I.e. if N = 3 then I would like to find the maximum availability
per id
, in the last 3 days to the prior of each quarter end.
I know I could hardcode the dates but I would like to avoid that if possible. Is this something that windowing could help with?
Expected output based on above input:
| id | year | quarter | max_availability |
|------|------|---------|------------------|
| 1334 | 2020 | 01 | 7564 |
| 1334 | 2020 | 02 | 4634 |
You can try this code below, which gets the last 10 rows before the end of a quarter and calculates the maximum availability grouped by id.
from pyspark.sql import functions as F, Window
df2 = df.withColumn(
'rn',
F.row_number().over(
Window.partitionBy(
'id',
F.quarter('yyyy_mm_dd'),
F.year('yyyy_mm_dd')
).orderBy(F.desc('yyyy_mm_dd'))
)
).filter('rn <= 10').groupBy(
'id',
F.quarter('yyyy_mm_dd').alias('quarter'),
F.year('yyyy_mm_dd').alias('year')
).agg(
F.max('availability').alias('max_availability')
)
df2.show()
+----+-------+----+----------------+
| id|quarter|year|max_availability|
+----+-------+----+----------------+
|1334| 1|2020| 7564|
|1334| 2|2020| 4634|
+----+-------+----+----------------+