In ISO 8601, durations are in the format PT5M ( 5 minutes) or PT2H5M (2 hours 5 minutes). I have a JSON file that contains values in such a format. I wanted to know if spark can extract the duration in minutes. I tried to read it as "DateType" and used the "minutes" function to get minutes, it returned me with null values.
Example json
{"name": "Fennel Mushrooms","cookTime":"PT30M"}
Currently, I am reading it as a string and using the "regex_extract" function. I wanted to know a more efficient way.
Spark does not provide for a way to convert ISO 8601 duration into intervals. Neither does timedelta
in Python datetime
library.
However, pd.Timdelta
can parse ISO 8601 duration to time deltas. To support of a wider category of ISO 8601 duration, we can wrap the pd.Timdelta
in a pandas_udf
from pyspark.sql import functions as F
import pandas as pd
df = spark.createDataFrame([("PT5M", ), ("PT50M", ), ("PT2H5M", ), ], ("duration", ))
@F.pandas_udf("int")
def parse_iso8601_duration(str_duration: pd.Series) -> pd.Series:
return str_duration.apply(lambda duration: (pd.Timedelta(duration).seconds / 60))
df.withColumn("duration_in_minutes", parse_iso8601_duration(F.col("duration"))).show()
+--------+-------------------+
|duration|duration_in_minutes|
+--------+-------------------+
| PT5M| 5|
| PT50M| 50|
| PT2H5M| 125|
+--------+-------------------+