Search code examples
pythondatedatetimetimezonepython-polars

How to extract date from datetime column in polars


I am trying to move from pandas to polars but I am running into the following issue.

df = pl.DataFrame(
    {
        "integer": [1, 2, 3], 
        "date": [
            "2010-01-31T23:00:00+00:00",
            "2010-02-01T00:00:00+00:00",
            "2010-02-01T01:00:00+00:00"
        ]
    }
)

df = df.with_columns(
    pl.col("date").str.to_datetime().dt.convert_time_zone("Europe/Amsterdam")
)

Yields the following dataframe:

shape: (3, 2)
┌─────────┬────────────────────────────────┐
│ integer ┆ date                           │
│ ---     ┆ ---                            │
│ i64     ┆ datetime[μs, Europe/Amsterdam] │
╞═════════╪════════════════════════════════╡
│ 1       ┆ 2010-02-01 00:00:00 CET        │
│ 2       ┆ 2010-02-01 01:00:00 CET        │
│ 3       ┆ 2010-02-01 02:00:00 CET        │
└─────────┴────────────────────────────────┘

As you can see, I transformed the datetime string from UTC to CET succesfully.

However, if I try to cast to pl.Date (as suggested here), it seems to extract the date from the UTC string even though it has been transformed, e.g.:

df = df.with_columns(
    pl.col("date").cast(pl.Date).alias("valueDay")
)
shape: (3, 3)
┌─────────┬────────────────────────────────┬────────────┐
│ integer ┆ date                           ┆ valueDay   │
│ ---     ┆ ---                            ┆ ---        │
│ i64     ┆ datetime[μs, Europe/Amsterdam] ┆ date       │
╞═════════╪════════════════════════════════╪════════════╡
│ 1       ┆ 2010-02-01 00:00:00 CET        ┆ 2010-01-31 │ # <- NOT OK
│ 2       ┆ 2010-02-01 01:00:00 CET        ┆ 2010-02-01 │
│ 3       ┆ 2010-02-01 02:00:00 CET        ┆ 2010-02-01 │
└─────────┴────────────────────────────────┴────────────┘

The valueDay should be 2010-02-01 for all 3 values.

Can anyone help me fix this? A pandas dt.date like way to approach this would be nice.

By the way, what is the best way to optimize this code? Do I constantly have to assign everything to df or is there a way to chain all of this?


Solution

  • Update: .dt.date() has since been added to Polars.

    import polars as pl
    
    df = pl.DataFrame(
        {
            "integer": [1, 2, 3],
            "date": [
                "2010-01-31T23:00:00+00:00",
                "2010-02-01T00:00:00+00:00",
                "2010-02-01T01:00:00+00:00",
            ],
        }
    )
    
    df = df.with_columns(
        pl.col("date").str.to_datetime().dt.convert_time_zone("Europe/Amsterdam")
    )
    
    df = df.with_columns(
        pl.col("date").dt.date().alias("valueDay"),
        pl.col("date").dt.day().alias("day"),
        pl.col("date").dt.month().alias("month"),
        pl.col("date").dt.year().alias("year"),
    )
    
    shape: (3, 6)
    ┌─────────┬────────────────────────────────┬────────────┬─────┬───────┬──────┐
    │ integer ┆ date                           ┆ valueDay   ┆ day ┆ month ┆ year │
    │ ---     ┆ ---                            ┆ ---        ┆ --- ┆ ---   ┆ ---  │
    │ i64     ┆ datetime[μs, Europe/Amsterdam] ┆ date       ┆ i8  ┆ i8    ┆ i32  │
    ╞═════════╪════════════════════════════════╪════════════╪═════╪═══════╪══════╡
    │ 1       ┆ 2010-02-01 00:00:00 CET        ┆ 2010-02-01 ┆ 1   ┆ 2     ┆ 2010 │
    │ 2       ┆ 2010-02-01 01:00:00 CET        ┆ 2010-02-01 ┆ 1   ┆ 2     ┆ 2010 │
    │ 3       ┆ 2010-02-01 02:00:00 CET        ┆ 2010-02-01 ┆ 1   ┆ 2     ┆ 2010 │
    └─────────┴────────────────────────────────┴────────────┴─────┴───────┴──────┘