Search code examples
pythondatetimetimezonepython-polars

`convert_time_zone` function to retrieve the values based on the timezone specified for each row in Polars


I'm attempting to determine the time based on the timezone specified in each row using Polars. Consider the following code snippet:

import polars as pl
from datetime import datetime
from polars import col as c

df = pl.DataFrame({
    "time": [datetime(2023, 4, 3, 2), datetime(2023, 4, 4, 3), datetime(2023, 4, 5, 4)],
    "tzone": ["Asia/Tokyo", "America/Chicago", "Europe/Paris"]
}).with_columns(c.time.dt.replace_time_zone("UTC"))

df.with_columns(
    tokyo=c.time.dt.convert_time_zone("Asia/Tokyo").dt.hour(),
    chicago=c.time.dt.convert_time_zone("America/Chicago").dt.hour(),
    paris=c.time.dt.convert_time_zone("Europe/Paris").dt.hour()
)

In this example, I've computed the time separately for each timezone to achieve the desired outcome, which is [11, 22, 6], corresponding to the hour of the time column according to the tzone timezone. Even then it is difficult to collect the information from the correct column.

Unfortunately, the following simple attempt to dynamically pass the timezone from the tzone column directly into the convert_time_zone function does not work:

df.with_columns(c.time.dt.convert_time_zone(c.tzone).dt.hour())
# TypeError: argument 'time_zone': 'Expr' object cannot be converted to 'PyString'

What would be the most elegant approach to accomplish this task?


Solution

  • The only way to do this which fully works with lazy execution is to use the polars-xdt plugin:

    df = pl.DataFrame(
        {
            "time": [
                datetime(2023, 4, 3, 2),
                datetime(2023, 4, 4, 3),
                datetime(2023, 4, 5, 4),
            ],
            "tzone": ["Asia/Tokyo", "America/Chicago", "Europe/Paris"],
        }
    ).with_columns(pl.col("time").dt.replace_time_zone("UTC"))
    
    df.with_columns(
        result=xdt.to_local_datetime("time", pl.col("tzone")).dt.hour(),
    )
    

    Result:

    Out[6]:
    shape: (3, 3)
    ┌─────────────────────────┬─────────────────┬────────┐
    │ time                    ┆ tzone           ┆ result │
    │ ---                     ┆ ---             ┆ ---    │
    │ datetime[μs, UTC]       ┆ str             ┆ i8     │
    ╞═════════════════════════╪═════════════════╪════════╡
    │ 2023-04-03 02:00:00 UTC ┆ Asia/Tokyo      ┆ 11     │
    │ 2023-04-04 03:00:00 UTC ┆ America/Chicago ┆ 22     │
    │ 2023-04-05 04:00:00 UTC ┆ Europe/Paris    ┆ 6      │
    └─────────────────────────┴─────────────────┴────────┘
    

    https://github.com/pola-rs/polars-xdt

    If you don't need lazy execution, then as other answers have suggested, you can iterate over the unique values of your 'time_zone' column