Search code examples
pythonpython-polars

Polars - use is_in() with durations


I have a dataframe with a column containing a set of durations, like 5m, 15m, etc...

df = pl.DataFrame({
    "duration_m": [5, 15, 30]
})
df = df.with_columns(
    duration = pl.duration( minutes = pl.col("duration_m"))
)
df

shape: (3, 2)
┌────────────┬──────────────┐
│ duration_m ┆ duration     │
│ ---        ┆ ---          │
│ i64        ┆ duration[ns] │
╞════════════╪══════════════╡
│ 5          ┆ 5m           │
│ 15         ┆ 15m          │
│ 30         ┆ 30m          │
└────────────┴──────────────┘

I want to filter this dataframe to return only rows that are 5m or 15m using is_in(). However, when I do this:

my_durations = [pl.duration(minutes=5), pl.duration(minutes=15)]
df.filter(pl.col("duration").is_in(my_durations))

I get:

InvalidOperationError: 'is_in' cannot check for Object("object", Some(object-registry)) values in Duration(Microseconds) data

This has no error (but is obviously not what I want to do):

df.filter(pl.col("duration").is_in(pl.duration(minutes=5)))

Thanks


Solution

  • I'm not sure why, even though pl.duration is an Expr and .is_in accepts expressions that it errors out. My best guess is that it sees a python list first so it assumes away getting an expression. From there it doesn't get a python type that it knows what to do with and just sees an object.

    Causes aside, you have two ways to get around this.

    The first: If you wrap your list in a pl.concat_list then you still have an expression instead of a python list but that won't be enough because it doesn't broadcast that to the whole df so you'll get an error about size mismatch which you can also fix by wrapping that in pl.repeat like this:

    df.filter(pl.col('duration')
              .is_in(pl.repeat(pl.concat_list(my_durations), pl.count()))
              )
    shape: (2, 2)
    ┌────────────┬──────────────┐
    │ duration_m ┆ duration     │
    │ ---        ┆ ---          │
    │ i64        ┆ duration[ns] │
    ╞════════════╪══════════════╡
    │ 5          ┆ 5m           │
    │ 15         ┆ 15m          │
    └────────────┴──────────────┘
    

    That's awfully clunky, better to just use the second method.

    The second: Just use timedelta instead of duration when you're not setting it with the values in a df like this:

    from datetime import timedelta
    my_durations = [timedelta(minutes=5), timedelta(minutes=15)]
    df.filter(pl.col("duration").is_in(my_durations))
    

    As an aside...

    if you start with

    my_durations = [pl.duration(minutes=5), pl.duration(minutes=15)]
    

    then you can get the same timedelta list by doing:

    [pl.select(x).item() for x in my_durations]