Search code examples
pysparkwindow-functionslag

Pyspark Lag function returning null


I have a dataframe that looks like this

>>> df.show()
+----------------------+------------------------+--------------------+
|date_cast             |id                      |         status    |
+----------------------+------------------------+--------------------+
|            2021-02-20|    123...              |open                |
|            2021-02-21|    123...              |open                |
|            2021-02-17|    123...              |closed              |
|            2021-02-22|    123...              |open                |
|            2021-02-19|    123...              |open                |
|            2021-02-18|    123...              |closed              |
+----------------------+------------------------+--------------------+

I have been trying to apply a very simple lag on to it to see what its previous day status was but I keep getting null. The date was a string so I casted, thinking maybe that is the issue due to date not ordering in results. I also have hard coded the windowing in my over partition by and still get null.

df_lag = df.withColumn('lag_status',F.lag(df['status']) \
                                 .over(Window.partitionBy("date_cast").orderBy(F.asc('date_cast')))).show()

Can someone help with any issues below?

>>> column_list = ["date_cast","id"]
>>> win_spec = Window.partitionBy([F.col(x) for x in column_list]).orderBy(F.asc('date_cast'))
>>> df.withColumn('lag_status', F.lag('status').over(
...     win_spec
...     )
... )

+----------------------+------------------------+--------------------+-----------+
|date_cast             |id.                      |         staus      |lag_status|
+----------------------+------------------------+--------------------+-----------+
|            2021-02-19|    123...              |open                |       null|
|            2021-02-21|    123...              |open                |       null|
|            2021-02-17|    123...              |open                |       null|
|            2021-02-18|    123...              |open                |       null|
|            2021-02-22|    123...              |open                |       null|
|            2021-02-20|    123...              |open                |       null|
+----------------------+------------------------+--------------------+-----------+

Solution

  • This happend because You have partitioned data by date_cast and date_cast have unique values. Use "id" instead date_cast for example:

    df_lag = df.withColumn('lag_status',F.lag(df['status']) \
                                     .over(Window.partitionBy("id").orderBy(F.asc('date_cast')))).show()