I have tried the code as in (this_post) and cannot get the date difference in seconds. I just take the datediff() between the columns 'Attributes_Timestamp_fix' and 'lagged_date' below. Any hints? Below my code and output.
eg = eg.withColumn("lagged_date", lag(eg.Attributes_Timestamp_fix, 1)
.over(Window.partitionBy("id")
.orderBy("Attributes_Timestamp_fix")))
eg = eg.withColumn("time_diff",
datediff(eg.Attributes_Timestamp_fix, eg.lagged_date))
id Attributes_Timestamp_fix time_diff
0 3.531611e+14 2018-04-01 00:01:02 NaN
1 3.531611e+14 2018-04-01 00:01:02 0.0
2 3.531611e+14 2018-04-01 00:03:13 0.0
3 3.531611e+14 2018-04-01 00:03:13 0.0
4 3.531611e+14 2018-04-01 00:03:13 0.0
5 3.531611e+14 2018-04-01 00:03:13 0.0
In pyspark.sql.functions
, there is a function datediff
that unfortunately only computes differences in days. To overcome this, you can convert both dates in unix timestamps (in seconds) and compute the difference.
Let's create some sample data, compute the lag and then the difference in seconds.
from pyspark.sql.functions import col, lag, unix_timestamp
from pyspark.sql.window import Window
import datetime
d = [{'id' : 1, 't' : datetime.datetime(2018,01,01)},\
{'id' : 1, 't' : datetime.datetime(2018,01,02)},\
{'id' : 1, 't' : datetime.datetime(2018,01,04)},\
{'id' : 1, 't' : datetime.datetime(2018,01,07)}]
df = spark.createDataFrame(d)
df.show()
+---+-------------------+
| id| t|
+---+-------------------+
| 1|2018-01-01 00:00:00|
| 1|2018-01-02 00:00:00|
| 1|2018-01-04 00:00:00|
| 1|2018-01-07 00:00:00|
+---+-------------------+
w = Window.partitionBy('id').orderBy('t')
df.withColumn("previous_t", lag(df.t, 1).over(w))\
.select(df.t, (unix_timestamp(df.t) - unix_timestamp(col('previous_t'))).alias('diff'))\
.show()
+-------------------+------+
| t| diff|
+-------------------+------+
|2018-01-01 00:00:00| null|
|2018-01-02 00:00:00| 86400|
|2018-01-04 00:00:00|172800|
|2018-01-07 00:00:00|259200|
+-------------------+------+