I have a dataframe with a column containing a set of durations, like 5m, 15m, etc...
df = pl.DataFrame({
"duration_m": [5, 15, 30]
})
df = df.with_columns(
duration = pl.duration( minutes = pl.col("duration_m"))
)
df
shape: (3, 2)
┌────────────┬──────────────┐
│ duration_m ┆ duration │
│ --- ┆ --- │
│ i64 ┆ duration[ns] │
╞════════════╪══════════════╡
│ 5 ┆ 5m │
│ 15 ┆ 15m │
│ 30 ┆ 30m │
└────────────┴──────────────┘
I want to filter this dataframe to return only rows that are 5m or 15m using is_in(). However, when I do this:
my_durations = [pl.duration(minutes=5), pl.duration(minutes=15)]
df.filter(pl.col("duration").is_in(my_durations))
I get:
InvalidOperationError: 'is_in' cannot check for Object("object", Some(object-registry)) values in Duration(Microseconds) data
This has no error (but is obviously not what I want to do):
df.filter(pl.col("duration").is_in(pl.duration(minutes=5)))
Thanks
I'm not sure why, even though pl.duration
is an Expr and .is_in
accepts expressions that it errors out. My best guess is that it sees a python list first so it assumes away getting an expression. From there it doesn't get a python type that it knows what to do with and just sees an object.
Causes aside, you have two ways to get around this.
The first:
If you wrap your list in a pl.concat_list
then you still have an expression instead of a python list but that won't be enough because it doesn't broadcast that to the whole df so you'll get an error about size mismatch which you can also fix by wrapping that in pl.repeat
like this:
df.filter(pl.col('duration')
.is_in(pl.repeat(pl.concat_list(my_durations), pl.count()))
)
shape: (2, 2)
┌────────────┬──────────────┐
│ duration_m ┆ duration │
│ --- ┆ --- │
│ i64 ┆ duration[ns] │
╞════════════╪══════════════╡
│ 5 ┆ 5m │
│ 15 ┆ 15m │
└────────────┴──────────────┘
That's awfully clunky, better to just use the second method.
The second:
Just use timedelta
instead of duration
when you're not setting it with the values in a df like this:
from datetime import timedelta
my_durations = [timedelta(minutes=5), timedelta(minutes=15)]
df.filter(pl.col("duration").is_in(my_durations))
As an aside...
if you start with
my_durations = [pl.duration(minutes=5), pl.duration(minutes=15)]
then you can get the same timedelta
list by doing:
[pl.select(x).item() for x in my_durations]