I am working with a dataframe like this:
Id | TimeStamp | Event | DeviceId
1 | 5.2.2019 8:00:00 | connect | 1
2 | 5.2.2019 8:00:05 | disconnect| 1
I am using databricks and pyspark to do the ETL process. How can I calculate and create such a dataframe as shown at the bottom? I have already tried using a UDF but I could not find a way to make it work. I also tried to do it by iterating over the whole data frame, but this is extremely slow.
I want to aggregate this dataframe to get a new dataframe which tells me the times, how long each device has been connected and disconnected:
Id | StartDateTime | EndDateTime | EventDuration |State | DeviceId
1 | 5.2.19 8:00:00 | 5.2.19 8:00:05| 0.00:00:05 |connected| 1
I think you can make this work with a window
function and some further column creations with withColumn
.
The code I did should create the mapping for devices and create a table with the duration for each state. The only requirement is that connect and disconnect appear alternatively.
Then you can use the following code:
from pyspark.sql.types import *
from pyspark.sql.functions import *
from pyspark.sql.window import Window
import datetime
test_df = sqlContext.createDataFrame([(1,datetime.datetime(2019,2,5,8),"connect",1),
(2,datetime.datetime(2019,2,5,8,0,5),"disconnect",1),
(3,datetime.datetime(2019,2,5,8,10),"connect",1),
(4,datetime.datetime(2019,2,5,8,20),"disconnect",1),],
["Id","TimeStamp","Event","DeviceId"])
#creation of dataframe with 4 events for 1 device
test_df.show()
Output:
+---+-------------------+----------+--------+
| Id| TimeStamp| Event|DeviceId|
+---+-------------------+----------+--------+
| 1|2019-02-05 08:00:00| connect| 1|
| 2|2019-02-05 08:00:05|disconnect| 1|
| 3|2019-02-05 08:10:00| connect| 1|
| 4|2019-02-05 08:20:00|disconnect| 1|
+---+-------------------+----------+--------+
Then you can create the helper functions and the window:
my_window = Window.partitionBy("DeviceId").orderBy(col("TimeStamp").desc()) #create window
get_prev_time = lag(col("Timestamp"),1).over(my_window) #get previous timestamp
time_diff = get_prev_time.cast("long") - col("TimeStamp").cast("long") #compute duration
test_df.withColumn("EventDuration",time_diff)\
.withColumn("EndDateTime",get_prev_time)\ #apply the helper functions
.withColumnRenamed("TimeStamp","StartDateTime")\ #rename according to your schema
.withColumn("State",when(col("Event")=="connect", "connected").otherwise("disconnected"))\ #create the state column
.filter(col("EventDuration").isNotNull()).select("Id","StartDateTime","EndDateTime","EventDuration","State","DeviceId").show()
#finally some filtering for the last events, which do not have a previous time
Output:
+---+-------------------+-------------------+-------------+------------+--------+
| Id| StartDateTime| EndDateTime|EventDuration| State|DeviceId|
+---+-------------------+-------------------+-------------+------------+--------+
| 3|2019-02-05 08:10:00|2019-02-05 08:20:00| 600| connected| 1|
| 2|2019-02-05 08:00:05|2019-02-05 08:10:00| 595|disconnected| 1|
| 1|2019-02-05 08:00:00|2019-02-05 08:00:05| 5| connected| 1|
+---+-------------------+-------------------+-------------+------------+--------+