Search code examples
stringapache-sparkpysparkapache-spark-sql

Convert duration (string column) to seconds PySpark


Can I convert easily a string column (with time duration) to seconds in PySpark?

Is there any function that does this directly? I would avoid to multiplicate each part of my string for seconds.

Input

id duration
1 00 00:00:34
2 00 00:04:37
3 120 00:04:37
... ...

NOTE:

Id 1 -> 0 days, 0 hours, 0 minutes, 34 seconds

Id 2 -> 0 days, 0 hours, 4 minutes, 37 seconds

Id 3 -> 120 days, 0 hours, 4 minutes, 37 seconds

Output

id duration
1 34
2 277
3 ...
... ...


Solution

  • You can get day,min,hour and seconds from duration column by applying split and then sum up the corresponding seconds to get desired result.

    df = # input
    
    df.withColumn("duration", split("duration", "\\s+")) \
        .withColumn("time", split(col("duration").getItem(1), ':')) \
        .select(col("id"),
                ((col("duration").getItem(0).cast("int") * 86400) +
                 (col("time").getItem(0).cast("int") * 3600) +
                 (col("time").getItem(1).cast("int") * 60) +
                 (col("time").getItem(2))).cast("long").alias("duration")
                ).show()
    
    +---+--------+
    | id|duration|
    +---+--------+
    |  1|      34|
    |  2|     277|
    |  3|10368277|
    +---+--------+