I try to get the time difference "time_d" in seconds of a timestamp within "name" in Pyspark.
+-------------------+----+
| timestamplast|name|
+-------------------+----+
|2019-08-01 00:00:00| 1|
|2019-08-01 00:01:00| 1|
|2019-08-01 00:01:15| 1|
|2019-08-01 03:00:00| 2|
|2019-08-01 04:00:00| 2|
|2019-08-01 00:15:00| 3|
+-------------------+----+
Output should look like:
+-------------------+----+--------+
| timestamplast|name| time_d |
+-------------------+----+------- +
|2019-08-01 00:00:00| 1| 0 |
|2019-08-01 00:01:00| 1| 60 |
|2019-08-01 00:01:15| 1| 15 |
|2019-08-01 03:00:00| 2| 0 |
|2019-08-01 04:00:00| 2| 3600 |
|2019-08-01 00:15:00| 3| 0 |
+-------------------+----+--------+
In Pandas this would be:
df['time_d'] = df.groupby("name")['timestamplast'].diff().fillna(pd.Timedelta(0)).dt.total_seconds()
How would this be done in Pyspark?
You can use a lag window function(partitioned by name)
and then compute the difference using timestamp in seconds(unix_timestamp)
.
from pyspark.sql import functions as F
from pyspark.sql.window import Window
w=Window().partitionBy("name").orderBy(F.col("timestamplast"))
df.withColumn("time_d", F.lag(F.unix_timestamp("timestamplast")).over(w))\
.withColumn("time_d", F.when(F.col("time_d").isNotNull(), F.unix_timestamp("timestamplast")-F.col("time_d"))\
.otherwise(F.lit(0))).orderBy("name","timestamplast").show()
#+-------------------+----+------+
#| timestamplast|name|time_d|
#+-------------------+----+------+
#|2019-08-01 00:00:00| 1| 0|
#|2019-08-01 00:01:00| 1| 60|
#|2019-08-01 00:01:15| 1| 15|
#|2019-08-01 03:00:00| 2| 0|
#|2019-08-01 04:00:00| 2| 3600|
#|2019-08-01 00:15:00| 3| 0|
#+-------------------+----+------+