Search code examples
pythonparsingdurationpython-polarsrust-chrono

Trouble with strptime() conversion of duration time string


I have some duration type data (lap times) as pl.Utf8 that fails to convert using strptime, whereas regular datetimes work as expected.

Minutes (before :) and Seconds (before .) are always padded to two digits, Milliseconds are always 3 digits.

Lap times are always < 2 min.

df = pl.DataFrame({
    "lap_time": ["01:14.007", "00:53.040", "01:00.123"]
})

df = df.with_columns(
    [
        # pl.col('release_date').str.strptime(pl.Date, fmt="%B %d, %Y"), # works
        pl.col('lap_time').str.strptime(pl.Time, fmt="%M:%S.%3f").cast(pl.Duration), # fails
    ]
)

So I used the chrono format specifier definitions from https://docs.rs/chrono/latest/chrono/format/strftime/index.html which are used as per the polars docs of strptime

the second conversion (for lap_time) always fails, no matter whether I use .%f, .%3f, %.3f. Apparently, strptime doesn't allow creating a pl.Duration directly, so I tried with pl.Time but it fails with error:

ComputeError: strict conversion to dates failed, maybe set strict=False

but setting strict=False yields all null values for the whole Series.

Am I missing something or this some weird behavior on chrono's or python-polars part?


Solution

  • General case

    In case you have duration that may exceed 24 hours, you can extract data (minutes, seconds and so on) from string using regex pattern. For example:

    df = pl.DataFrame({
        "time": ["+01:14.007", "100:20.000", "-05:00.000"]
    })
    
    df.with_columns(
        pl.col("time").str.extract_all(r"([+-]?\d+)")
        #                                /
        #                 you will get array of length 3
        #                 ["min", "sec", "ms"]
    ).with_columns(
        pl.duration(
            minutes=pl.col("time").arr.get(0),
            seconds=pl.col("time").arr.get(1),
            milliseconds=pl.col("time").arr.get(2)
        ).alias("time")
    )
    
    ┌──────────────┐
    │ time         │
    │ ---          │
    │ duration[ns] │
    ╞══════════════╡
    │ 1m 14s 7ms   │
    │ 1h 40m 20s   │
    │ -5m          │
    └──────────────┘
    

    About pl.Time

    To convert data to pl.Time, you need to specify hours as well. When you add 00 hours to your time, code will work:

    df = pl.DataFrame({"str_time": ["01:14.007", "01:18.880"]})
    
    df.with_columns(
        duration = (pl.lit("00:") + pl.col("str_time"))\
            .str.strptime(pl.Time, fmt="%T%.3f")\
            .cast(pl.Duration)
    )
    
    ┌───────────┬──────────────┐
    │ str_time  ┆ duration     │
    │ ---       ┆ ---          │
    │ str       ┆ duration[μs] │
    ╞═══════════╪══════════════╡
    │ 01:14.007 ┆ 1m 14s 7ms   │
    │ 01:18.880 ┆ 1m 18s 880ms │
    └───────────┴──────────────┘