Search code examples
pythonpython-polars

How can I filter a list within a Polars column?


Say for example I have data like this:

import polars as pl

df = pl.DataFrame(
    {
        "subject": ["subject1", "subject2"],
        "emails": [
            ["samATxyz.com", "janeATxyz.com", "jimATcustomer.org"],
            ["samATxyz.com", "zaneATxyz.com", "basATcustomer.org", "jimATcustomer.org"],
        ],
    }
)

df
shape: (2, 2)
┌──────────┬─────────────────────────────────────────────────────────────────────────────┐
│ subject  ┆ emails                                                                      │
│ ---      ┆ ---                                                                         │
│ str      ┆ list[str]                                                                   │
╞══════════╪═════════════════════════════════════════════════════════════════════════════╡
│ subject1 ┆ ["samATxyz.com", "janeATxyz.com", "jimATcustomer.org"]                      │
│ subject2 ┆ ["samATxyz.com", "zaneATxyz.com", "basATcustomer.org", "jimATcustomer.org"] │
└──────────┴─────────────────────────────────────────────────────────────────────────────┘

I want to filter the data so that the emails column only contain emails that end in "ATxyz.com".

shape: (2, 2)
┌──────────┬───────────────────────────────────┐
│ subject  ┆ emails                            │
│ ---      ┆ ---                               │
│ str      ┆ list[str]                         │
╞══════════╪═══════════════════════════════════╡
│ subject1 ┆ ["samATxyz.com", "janeATxyz.com"] │
│ subject2 ┆ ["samATxyz.com", "zaneATxyz.com"] │
└──────────┴───────────────────────────────────┘

How can I do this using polars?

I had a few ideas, but I cannot figure out the right syntax, or it seems more complex/verbose than I would expect:

  • Maybe I could somehow filter the data using .list.eval(pl.element() ..., but I cannot figure out how to filter items in the list with this syntax.
  • I could reshape the data using .explode, but this seems verbose and more complex than needed.

This is as close as I have got

import polars as pl

df = pl.DataFrame(
    {
        "subject": ["subject1", "subject2"],
        "emails": [
            ["samATxyz.com", "janeATxyz.com", "jimATcustomer.org"],
            ["samATxyz.com", "zaneATxyz.com", "basATcustomer.org", "jimATcustomer.org"],
        ],
    }
)

df.with_columns(
    pl.col("emails").list.eval(pl.element().str.contains("ATxyz")),
)
shape: (2, 2)
┌──────────┬────────────────────────────┐
│ subject  ┆ emails                     │
│ ---      ┆ ---                        │
│ str      ┆ list[bool]                 │
╞══════════╪════════════════════════════╡
│ subject1 ┆ [true, true, false]        │
│ subject2 ┆ [true, true, false, false] │
└──────────┴────────────────────────────┘

Solution

  • You were on the right track with pl.Expr.list.eval. It can be combined with pl.Expr.filter to achieve the desired result as follows.

    df.with_columns(
        pl.col("emails").list.eval(
            pl.element().filter(pl.element().str.ends_with("ATxyz.com"))
        )
    )
    
    shape: (2, 2)
    ┌──────────┬───────────────────────────────────┐
    │ subject  ┆ emails                            │
    │ ---      ┆ ---                               │
    │ str      ┆ list[str]                         │
    ╞══════════╪═══════════════════════════════════╡
    │ subject1 ┆ ["samATxyz.com", "janeATxyz.com"] │
    │ subject2 ┆ ["samATxyz.com", "zaneATxyz.com"] │
    └──────────┴───────────────────────────────────┘