Search code examples
pysparkapache-spark-sql

(pyspark)How to Divide Time Intervals into Time Periods


I have a dataframe created by sparksql with IDs corresponding to checkin_datetime and checkout_datetime.As the picture shows. input dataframe

I would like to divide this time interval into one-hour time periods. As the picture shows. expected result

Code to create sparkdataframe:

import pandas as pd
data={'ID':[4,4,4,4,22,22,25,29],


 'checkin_datetime':['04-01-2019 13:07','04-01-2019 13:09','04-01-2019 14:06','04-01-2019 14:55','04-01-2019 20:23'
  ,'04-01-2019 21:38','04-01-2019 23:22','04-02-2019 01:00'],
  'checkout_datetime':['04-01-2019 13:09','04-01-2019 13:12','04-01-2019 14:07','04-01-2019 15:06','04-01-2019 21:32'
                       ,'04-01-2019 21:42','04-02-2019 00:23'
                       ,'04-02-2019 06:15']
}
df = pd.DataFrame(data,columns= ['ID', 'checkin_datetime','checkout_datetime'])
df1=spark.createDataFrame(df)

Solution

  • To compute hourly interval,

    1. First explode hourly intervals between checkin_datetime and checkout_datetime. We do this by computing the hours between the checkin_datetime and checkout_datetime and iterating over the range to generate the intervals.
    2. Once we have exploded the intervals to find the next_hour, we can use this to identify the gap between checkin_datetime and next_hour or checkout_datetime and next_hour.
    from pyspark.sql import functions as F
    
    import pandas as pd
    data={'ID':[4,4,4,4,22,22,25,29],
    
    
     'checkin_datetime':['04-01-2019 13:07','04-01-2019 13:09','04-01-2019 14:06','04-01-2019 14:55','04-01-2019 20:23'
      ,'04-01-2019 21:38','04-01-2019 23:22','04-02-2019 01:00'],
      'checkout_datetime':['04-01-2019 13:09','04-01-2019 13:12','04-01-2019 14:07','04-01-2019 15:06','04-01-2019 21:32'
                           ,'04-01-2019 21:42','04-02-2019 00:23'
                           ,'04-02-2019 06:15']
    }
    df = pd.DataFrame(data,columns= ['ID', 'checkin_datetime','checkout_datetime'])
    df1=spark.createDataFrame(df).withColumn("checkin_datetime", F.to_timestamp("checkin_datetime", "MM-dd-yyyy HH:mm")).withColumn("checkout_datetime", F.to_timestamp("checkout_datetime", "MM-dd-yyyy HH:mm"))
    
    unix_checkin = F.unix_timestamp("checkin_datetime")
    unix_checkout = F.unix_timestamp("checkout_datetime")
    
    start_hour_checkin = F.date_trunc("hour", "checkin_datetime")
    unix_start_hour_checkin = F.unix_timestamp(start_hour_checkin)
    checkout_next_hour = F.date_trunc("hour", "checkout_datetime") + F.expr("INTERVAL 1 HOUR")
    
    diff_hours = F.floor((unix_checkout - unix_start_hour_checkin) / 3600)
    
    next_hour = F.explode(F.transform(F.sequence(F.lit(0), diff_hours), lambda x: F.to_timestamp(F.unix_timestamp(start_hour_checkin) + (x + 1) * 3600)))
    
    minute = (F.when(start_hour_checkin == F.date_trunc("hour", "checkout_datetime"), (unix_checkout - unix_checkin) / 60)
               .when(checkout_next_hour == F.col("next_hour"), (unix_checkout - F.unix_timestamp(F.date_trunc("hour", "checkout_datetime"))) / 60)
               .otherwise(F.least((F.unix_timestamp(F.col("next_hour")) - unix_checkin) / 60, F.lit(60)))
             ).cast("int")
    
    (df1.withColumn("next_hour", next_hour)
        .withColumn("minutes", minute)
        .withColumn("hr", F.date_format(F.expr("next_hour - INTERVAL 1 HOUR"), "H"))
        .withColumn("day", F.to_date(F.expr("next_hour - INTERVAL 1 HOUR")))
        .select("ID", "checkin_datetime", "checkout_datetime", "day", "hr", "minutes")
    ).show()
    """
    +---+-------------------+-------------------+----------+---+-------+
    | ID|   checkin_datetime|  checkout_datetime|       day| hr|minutes|
    +---+-------------------+-------------------+----------+---+-------+
    |  4|2019-04-01 13:07:00|2019-04-01 13:09:00|2019-04-01| 13|      2|
    |  4|2019-04-01 13:09:00|2019-04-01 13:12:00|2019-04-01| 13|      3|
    |  4|2019-04-01 14:06:00|2019-04-01 14:07:00|2019-04-01| 14|      1|
    |  4|2019-04-01 14:55:00|2019-04-01 15:06:00|2019-04-01| 14|      5|
    |  4|2019-04-01 14:55:00|2019-04-01 15:06:00|2019-04-01| 15|      6|
    | 22|2019-04-01 20:23:00|2019-04-01 21:32:00|2019-04-01| 20|     37|
    | 22|2019-04-01 20:23:00|2019-04-01 21:32:00|2019-04-01| 21|     32|
    | 22|2019-04-01 21:38:00|2019-04-01 21:42:00|2019-04-01| 21|      4|
    | 25|2019-04-01 23:22:00|2019-04-02 00:23:00|2019-04-01| 23|     38|
    | 25|2019-04-01 23:22:00|2019-04-02 00:23:00|2019-04-02|  0|     23|
    | 29|2019-04-02 01:00:00|2019-04-02 06:15:00|2019-04-02|  1|     60|
    | 29|2019-04-02 01:00:00|2019-04-02 06:15:00|2019-04-02|  2|     60|
    | 29|2019-04-02 01:00:00|2019-04-02 06:15:00|2019-04-02|  3|     60|
    | 29|2019-04-02 01:00:00|2019-04-02 06:15:00|2019-04-02|  4|     60|
    | 29|2019-04-02 01:00:00|2019-04-02 06:15:00|2019-04-02|  5|     60|
    | 29|2019-04-02 01:00:00|2019-04-02 06:15:00|2019-04-02|  6|     15|
    +---+-------------------+-------------------+----------+---+-------+
    """