Search code examples
python-polars

Parse string column containing date with only month and year


For example:

import polars as pl

dates = pl.DataFrame(pl.Series("date", ["2024,2", "2024,12"]))
dates = dates.with_columns(pl.col("date").str.to_date("%Y,%m"))

This does not work because of the missing day. Python's strptime adds a day of 1:

from time import strptime

strptime("2024,2", "%Y,%m")

Is there any way to do something similar with polars, apart from just attaching to the strings a day?

import polars as pl

dates = pl.DataFrame(pl.Series("date", ["2024,2", "2024,12"]))
dates = dates.with_columns((pl.col("date") + ",1").str.to_date(r"%Y,%m,%d"))

Which is an okay solution, but it does unnecessary work. So the best solution I have come up with in terms of not doing repetitive work is:

import polars as pl

dates = pl.DataFrame(pl.Series("date", ["2024,2", "2024,12"]))
dates = (
    dates.with_columns(
        pl.col("date").str.split(",").list.to_struct(fields=["year", "month"])
    )
    .unnest("date")
    .with_columns(date=pl.date(pl.col("year"), pl.col("month"), 1))
)

Solution

  • Going from a list to a struct requires copying the data and since strings are stored as stringview, it might be cheaper to just concat the string as in your first solution.

    At the very least you can skip going through the struct and unnesting which is also unnecessary by using .list.get(). You can couple this with the walrus operator to keep it DRY.

    dates = dates.with_columns(
        date=pl.date(
            (date_split := pl.col("date").str.split(",")).list.get(0),
            date_split.list.get(1),
            1,
        )
    )
    

    Alternatively, you could skip the list by using .str.split_exact which returns a struct upfront. If you do that, you can skip unnest with .struct[0] and .struct[1] as in

    dates.with_columns(
        date=pl.date(
            (datestruct := pl.col("date").str.split_exact(",", 1)).struct[0],
            datestruct.struct[1],
            1,
        )
    )
    

    Performance

    Setup

    df=pl.DataFrame({'dates':pl.date_range(pl.date(1970,1,1), pl.date(2024,7,1), '1mo', eager=True)})
    df=df.with_columns(pl.col('dates').dt.strftime("%Y,%m"))
    

    String concat

    %%timeit
    df.with_columns((pl.col('dates')+",1").str.to_date(r"%Y,%m,%d"))
    731 µs ± 6.81 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
    

    through list

    %%timeit
    dates.with_columns(
        date=pl.date(
            (date_split := pl.col("date").str.split(",")).list.get(0),
            date_split.list.get(1),
            1,
        )
    )
    1.35 ms ± 6.68 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
    

    through struct

    %%timeit
    dates.with_columns(
        date=pl.date(
            (datestruct := pl.col("date").str.split_exact(",", 1)).struct[0],
            datestruct.struct[1],
            1,
        )
    )
    1.29 ms ± 30.1 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
    

    Conclusion

    Using string concat is much faster than splitting the string and using a date constructor.