I am trying to get a window function to go back and get a previous row by a specific date and am not quite sure what is going wrong but it is giving me the previous row instead of the specified date row. To calculate this I am taking the current rows date and finding the current Monday in relation to that week like so
def previous_day(date, dayOfWeek):
return date_sub(next_day(date, "monday"), 7)
spark_df = spark_df.withColumn("last_monday", previous_day(spark_df['calendarday'], "monday"))
I am then calculating the difference between the current day and its closest previous Monday in days by
d = F.datediff(spark_df['calendarday'], spark_df['last_monday'])
spark_df = spark_df.withColumn("daysSinceMonday",d)
I can see from my daysSinceMonday value is correct per row. Next I want to create a window and choose the first row it it but range them by the d value that i set up but for some reason it doesn't work.
days = lambda i: i * 86400
w = (Window.partitionBy(column_list).orderBy(col('calendarday').cast("timestamp").cast("long")).rangeBetween(-days(d), 0))
spark_df = spark_df.withColumn('PreviousYearUnique', first("indexCP").over(w))
Starting Data Frame
## +---+-----------+-----------+--------+
## | id|calendarday|last_monday| indexCP|
## +---+-----------+-----------+--------+
## | 1|2015-01-05 | 2015-01-05| 0.0076|
## | 1|2015-01-06 | 2015-01-05| 0.0026|
## | 1|2015-01-07 | 2015-01-05| 0.0016|
## | 1|2015-01-08 | 2015-01-05| 0.0006|
## | 2|2015-01-09 | 2015-01-05| 0.0012|
## | 2|2015-01-10 | 2015-01-05| 0.0014|
## | 1|2015-01-12 | 2015-01-12| 0.0026|
## | 1|2015-01-13 | 2015-01-12| 0.0086|
## | 1|2015-01-14 | 2015-01-12| 0.0046|
## | 1|2015-01-15 | 2015-01-12| 0.0021|
## | 2|2015-01-16 | 2015-01-12| 0.0042|
## | 2|2015-01-17 | 2015-01-12| 0.0099|
## +---+-----------+-----------+--------+
New Data Frame Adding Previous last_mondays row indexCP as PreviousYearUnique
## +---+-----------+-----------+--------+--------------------+
## | id|calendarday|last_monday| indexCP| PreviousYearUnique |
## +---+-----------+-----------+--------+--------------------+
## | 1|2015-01-05 | 2015-01-05| 0.0076| 0.0076|
## | 1|2015-01-06 | 2015-01-05| 0.0026| 0.0076|
## | 1|2015-01-07 | 2015-01-05| 0.0016| 0.0076|
## | 1|2015-01-08 | 2015-01-05| 0.0006| 0.0076|
## | 2|2015-01-09 | 2015-01-05| 0.0012| 0.0076|
## | 2|2015-01-10 | 2015-01-05| 0.0014| 0.0076|
## | 1|2015-01-12 | 2015-01-12| 0.0026| 0.0026|
## | 1|2015-01-13 | 2015-01-12| 0.0086| 0.0026|
## | 1|2015-01-14 | 2015-01-12| 0.0046| 0.0026|
## | 1|2015-01-15 | 2015-01-12| 0.0021| 0.0026|
## | 2|2015-01-16 | 2015-01-12| 0.0042| 0.0026|
## | 2|2015-01-17 | 2015-01-12| 0.0099| 0.0026|
## +---+-----------+-----------+--------+--------------------+
Any ideas what is going wrong?
You could partitionBy
last_monday
over calendarday
on an unboundedPreceding
window, and then use first
from pyspark.sql import functions as F
from pyspark.sql.window import Window
w=Window().partitionBy("last_monday")\
.orderBy(F.to_date("calendarday","yyyy-MM-dd"))\
.rowsBetween(Window.unboundedPreceding,Window.currentRow)
df.withColumn("PreviousYearUnique", F.first("indexCP").over(w)).show()
#+---+-----------+-----------+-------+------------------+
#| id|calendarday|last_monday|indexCP|PreviousYearUnique|
#+---+-----------+-----------+-------+------------------+
#| 1| 2015-01-05| 2015-01-05| 0.0076| 0.0076|
#| 1| 2015-01-06| 2015-01-05| 0.0026| 0.0076|
#| 1| 2015-01-07| 2015-01-05| 0.0016| 0.0076|
#| 1| 2015-01-08| 2015-01-05| 6.0E-4| 0.0076|
#| 2| 2015-01-09| 2015-01-05| 0.0012| 0.0076|
#| 2| 2015-01-10| 2015-01-05| 0.0014| 0.0076|
#| 1| 2015-01-12| 2015-01-12| 0.0026| 0.0026|
#| 1| 2015-01-13| 2015-01-12| 0.0086| 0.0026|
#| 1| 2015-01-14| 2015-01-12| 0.0046| 0.0026|
#| 1| 2015-01-15| 2015-01-12| 0.0021| 0.0026|
#| 2| 2015-01-16| 2015-01-12| 0.0042| 0.0026|
#| 2| 2015-01-17| 2015-01-12| 0.0099| 0.0026|
#+---+-----------+-----------+-------+------------------+