I have a dataframe created by sparksql with IDs corresponding to checkin_datetime and checkout_datetime.As the picture shows.
I would like to divide this time interval into one-hour time periods. As the picture shows.
Code to create sparkdataframe:
import pandas as pd
data={'ID':[4,4,4,4,22,22,25,29],
'checkin_datetime':['04-01-2019 13:07','04-01-2019 13:09','04-01-2019 14:06','04-01-2019 14:55','04-01-2019 20:23'
,'04-01-2019 21:38','04-01-2019 23:22','04-02-2019 01:00'],
'checkout_datetime':['04-01-2019 13:09','04-01-2019 13:12','04-01-2019 14:07','04-01-2019 15:06','04-01-2019 21:32'
,'04-01-2019 21:42','04-02-2019 00:23'
,'04-02-2019 06:15']
}
df = pd.DataFrame(data,columns= ['ID', 'checkin_datetime','checkout_datetime'])
df1=spark.createDataFrame(df)
To compute hourly interval,
checkin_datetime
and checkout_datetime
. We do this by computing the hours between the checkin_datetime
and checkout_datetime
and iterating over the range to generate the intervals.next_hour
, we can use this to identify the gap between checkin_datetime
and next_hour
or checkout_datetime
and next_hour
.from pyspark.sql import functions as F
import pandas as pd
data={'ID':[4,4,4,4,22,22,25,29],
'checkin_datetime':['04-01-2019 13:07','04-01-2019 13:09','04-01-2019 14:06','04-01-2019 14:55','04-01-2019 20:23'
,'04-01-2019 21:38','04-01-2019 23:22','04-02-2019 01:00'],
'checkout_datetime':['04-01-2019 13:09','04-01-2019 13:12','04-01-2019 14:07','04-01-2019 15:06','04-01-2019 21:32'
,'04-01-2019 21:42','04-02-2019 00:23'
,'04-02-2019 06:15']
}
df = pd.DataFrame(data,columns= ['ID', 'checkin_datetime','checkout_datetime'])
df1=spark.createDataFrame(df).withColumn("checkin_datetime", F.to_timestamp("checkin_datetime", "MM-dd-yyyy HH:mm")).withColumn("checkout_datetime", F.to_timestamp("checkout_datetime", "MM-dd-yyyy HH:mm"))
unix_checkin = F.unix_timestamp("checkin_datetime")
unix_checkout = F.unix_timestamp("checkout_datetime")
start_hour_checkin = F.date_trunc("hour", "checkin_datetime")
unix_start_hour_checkin = F.unix_timestamp(start_hour_checkin)
checkout_next_hour = F.date_trunc("hour", "checkout_datetime") + F.expr("INTERVAL 1 HOUR")
diff_hours = F.floor((unix_checkout - unix_start_hour_checkin) / 3600)
next_hour = F.explode(F.transform(F.sequence(F.lit(0), diff_hours), lambda x: F.to_timestamp(F.unix_timestamp(start_hour_checkin) + (x + 1) * 3600)))
minute = (F.when(start_hour_checkin == F.date_trunc("hour", "checkout_datetime"), (unix_checkout - unix_checkin) / 60)
.when(checkout_next_hour == F.col("next_hour"), (unix_checkout - F.unix_timestamp(F.date_trunc("hour", "checkout_datetime"))) / 60)
.otherwise(F.least((F.unix_timestamp(F.col("next_hour")) - unix_checkin) / 60, F.lit(60)))
).cast("int")
(df1.withColumn("next_hour", next_hour)
.withColumn("minutes", minute)
.withColumn("hr", F.date_format(F.expr("next_hour - INTERVAL 1 HOUR"), "H"))
.withColumn("day", F.to_date(F.expr("next_hour - INTERVAL 1 HOUR")))
.select("ID", "checkin_datetime", "checkout_datetime", "day", "hr", "minutes")
).show()
"""
+---+-------------------+-------------------+----------+---+-------+
| ID| checkin_datetime| checkout_datetime| day| hr|minutes|
+---+-------------------+-------------------+----------+---+-------+
| 4|2019-04-01 13:07:00|2019-04-01 13:09:00|2019-04-01| 13| 2|
| 4|2019-04-01 13:09:00|2019-04-01 13:12:00|2019-04-01| 13| 3|
| 4|2019-04-01 14:06:00|2019-04-01 14:07:00|2019-04-01| 14| 1|
| 4|2019-04-01 14:55:00|2019-04-01 15:06:00|2019-04-01| 14| 5|
| 4|2019-04-01 14:55:00|2019-04-01 15:06:00|2019-04-01| 15| 6|
| 22|2019-04-01 20:23:00|2019-04-01 21:32:00|2019-04-01| 20| 37|
| 22|2019-04-01 20:23:00|2019-04-01 21:32:00|2019-04-01| 21| 32|
| 22|2019-04-01 21:38:00|2019-04-01 21:42:00|2019-04-01| 21| 4|
| 25|2019-04-01 23:22:00|2019-04-02 00:23:00|2019-04-01| 23| 38|
| 25|2019-04-01 23:22:00|2019-04-02 00:23:00|2019-04-02| 0| 23|
| 29|2019-04-02 01:00:00|2019-04-02 06:15:00|2019-04-02| 1| 60|
| 29|2019-04-02 01:00:00|2019-04-02 06:15:00|2019-04-02| 2| 60|
| 29|2019-04-02 01:00:00|2019-04-02 06:15:00|2019-04-02| 3| 60|
| 29|2019-04-02 01:00:00|2019-04-02 06:15:00|2019-04-02| 4| 60|
| 29|2019-04-02 01:00:00|2019-04-02 06:15:00|2019-04-02| 5| 60|
| 29|2019-04-02 01:00:00|2019-04-02 06:15:00|2019-04-02| 6| 15|
+---+-------------------+-------------------+----------+---+-------+
"""