Search code examples
pysparkpython-datetime

Pyspark get time difference from timestamps within column level


I try to get the time difference "time_d" in seconds of a timestamp within "name" in Pyspark.

+-------------------+----+
|      timestamplast|name|
+-------------------+----+
|2019-08-01 00:00:00|   1|
|2019-08-01 00:01:00|   1|
|2019-08-01 00:01:15|   1|
|2019-08-01 03:00:00|   2|
|2019-08-01 04:00:00|   2|
|2019-08-01 00:15:00|   3|
+-------------------+----+

Output should look like:

+-------------------+----+--------+
|      timestamplast|name| time_d |
+-------------------+----+------- +
|2019-08-01 00:00:00|   1| 0      | 
|2019-08-01 00:01:00|   1| 60     | 
|2019-08-01 00:01:15|   1| 15     |
|2019-08-01 03:00:00|   2| 0      |
|2019-08-01 04:00:00|   2| 3600   |
|2019-08-01 00:15:00|   3| 0      |
+-------------------+----+--------+

In Pandas this would be:

df['time_d'] = df.groupby("name")['timestamplast'].diff().fillna(pd.Timedelta(0)).dt.total_seconds()

How would this be done in Pyspark?


Solution

  • You can use a lag window function(partitioned by name) and then compute the difference using timestamp in seconds(unix_timestamp).

    from pyspark.sql import functions as F
    from pyspark.sql.window import Window
    
    w=Window().partitionBy("name").orderBy(F.col("timestamplast"))
    df.withColumn("time_d", F.lag(F.unix_timestamp("timestamplast")).over(w))\
      .withColumn("time_d", F.when(F.col("time_d").isNotNull(), F.unix_timestamp("timestamplast")-F.col("time_d"))\
                             .otherwise(F.lit(0))).orderBy("name","timestamplast").show()
    
    #+-------------------+----+------+
    #|      timestamplast|name|time_d|
    #+-------------------+----+------+
    #|2019-08-01 00:00:00|   1|     0|
    #|2019-08-01 00:01:00|   1|    60|
    #|2019-08-01 00:01:15|   1|    15|
    #|2019-08-01 03:00:00|   2|     0|
    #|2019-08-01 04:00:00|   2|  3600|
    #|2019-08-01 00:15:00|   3|     0|
    #+-------------------+----+------+