Search code examples
apache-sparkpysparkiso8601

Apache Spark: parse PT2H5M (duration ISO-8601) duration in minutes


In ISO 8601, durations are in the format PT5M ( 5 minutes) or PT2H5M (2 hours 5 minutes). I have a JSON file that contains values in such a format. I wanted to know if spark can extract the duration in minutes. I tried to read it as "DateType" and used the "minutes" function to get minutes, it returned me with null values.

Example json {"name": "Fennel Mushrooms","cookTime":"PT30M"}

Currently, I am reading it as a string and using the "regex_extract" function. I wanted to know a more efficient way.

https://www.digi.com/resources/documentation/digidocs/90001437-13/reference/r_iso_8601_duration_format.htm


Solution

  • Spark does not provide for a way to convert ISO 8601 duration into intervals. Neither does timedelta in Python datetime library.

    However, pd.Timdelta can parse ISO 8601 duration to time deltas. To support of a wider category of ISO 8601 duration, we can wrap the pd.Timdelta in a pandas_udf

    
    from pyspark.sql import functions as F
    import pandas as pd
    
    df = spark.createDataFrame([("PT5M", ), ("PT50M", ), ("PT2H5M", ), ], ("duration", ))
    
    @F.pandas_udf("int")
    def parse_iso8601_duration(str_duration: pd.Series) -> pd.Series:
        return str_duration.apply(lambda duration: (pd.Timedelta(duration).seconds / 60))
    
    df.withColumn("duration_in_minutes", parse_iso8601_duration(F.col("duration"))).show()
    

    Output

    +--------+-------------------+
    |duration|duration_in_minutes|
    +--------+-------------------+
    |    PT5M|                  5|
    |   PT50M|                 50|
    |  PT2H5M|                125|
    +--------+-------------------+