Say for example I have data like this:
import polars as pl
df = pl.DataFrame(
{
"subject": ["subject1", "subject2"],
"emails": [
["samATxyz.com", "janeATxyz.com", "jimATcustomer.org"],
["samATxyz.com", "zaneATxyz.com", "basATcustomer.org", "jimATcustomer.org"],
],
}
)
df
shape: (2, 2)
┌──────────┬─────────────────────────────────────────────────────────────────────────────┐
│ subject ┆ emails │
│ --- ┆ --- │
│ str ┆ list[str] │
╞══════════╪═════════════════════════════════════════════════════════════════════════════╡
│ subject1 ┆ ["samATxyz.com", "janeATxyz.com", "jimATcustomer.org"] │
│ subject2 ┆ ["samATxyz.com", "zaneATxyz.com", "basATcustomer.org", "jimATcustomer.org"] │
└──────────┴─────────────────────────────────────────────────────────────────────────────┘
I want to filter the data so that the emails column only contain emails that end in "ATxyz.com"
.
shape: (2, 2)
┌──────────┬───────────────────────────────────┐
│ subject ┆ emails │
│ --- ┆ --- │
│ str ┆ list[str] │
╞══════════╪═══════════════════════════════════╡
│ subject1 ┆ ["samATxyz.com", "janeATxyz.com"] │
│ subject2 ┆ ["samATxyz.com", "zaneATxyz.com"] │
└──────────┴───────────────────────────────────┘
How can I do this using polars?
I had a few ideas, but I cannot figure out the right syntax, or it seems more complex/verbose than I would expect:
.list.eval(pl.element() ...
, but I cannot figure out how to filter items in the list with this syntax..explode
, but this seems verbose and more complex than needed.This is as close as I have got
import polars as pl
df = pl.DataFrame(
{
"subject": ["subject1", "subject2"],
"emails": [
["samATxyz.com", "janeATxyz.com", "jimATcustomer.org"],
["samATxyz.com", "zaneATxyz.com", "basATcustomer.org", "jimATcustomer.org"],
],
}
)
df.with_columns(
pl.col("emails").list.eval(pl.element().str.contains("ATxyz")),
)
shape: (2, 2)
┌──────────┬────────────────────────────┐
│ subject ┆ emails │
│ --- ┆ --- │
│ str ┆ list[bool] │
╞══════════╪════════════════════════════╡
│ subject1 ┆ [true, true, false] │
│ subject2 ┆ [true, true, false, false] │
└──────────┴────────────────────────────┘
You were on the right track with pl.Expr.list.eval
. It can be combined with pl.Expr.filter
to achieve the desired result as follows.
df.with_columns(
pl.col("emails").list.eval(
pl.element().filter(pl.element().str.ends_with("ATxyz.com"))
)
)
shape: (2, 2)
┌──────────┬───────────────────────────────────┐
│ subject ┆ emails │
│ --- ┆ --- │
│ str ┆ list[str] │
╞══════════╪═══════════════════════════════════╡
│ subject1 ┆ ["samATxyz.com", "janeATxyz.com"] │
│ subject2 ┆ ["samATxyz.com", "zaneATxyz.com"] │
└──────────┴───────────────────────────────────┘