Can I convert easily a string column (with time duration) to seconds in PySpark?
Is there any function that does this directly? I would avoid to multiplicate each part of my string for seconds.
Input
id | duration |
---|---|
1 | 00 00:00:34 |
2 | 00 00:04:37 |
3 | 120 00:04:37 |
... | ... |
NOTE:
Id 1 -> 0 days, 0 hours, 0 minutes, 34 seconds
Id 2 -> 0 days, 0 hours, 4 minutes, 37 seconds
Id 3 -> 120 days, 0 hours, 4 minutes, 37 seconds
Output
id | duration |
---|---|
1 | 34 |
2 | 277 |
3 | ... |
... | ... |
You can get day,min,hour and seconds from duration column by applying split
and then sum up the corresponding seconds to get desired result.
df = # input
df.withColumn("duration", split("duration", "\\s+")) \
.withColumn("time", split(col("duration").getItem(1), ':')) \
.select(col("id"),
((col("duration").getItem(0).cast("int") * 86400) +
(col("time").getItem(0).cast("int") * 3600) +
(col("time").getItem(1).cast("int") * 60) +
(col("time").getItem(2))).cast("long").alias("duration")
).show()
+---+--------+
| id|duration|
+---+--------+
| 1| 34|
| 2| 277|
| 3|10368277|
+---+--------+