Search code examples
pythonpython-polars

How to filter DataFrame by any string column matching any regex pattern in a list?


String columns can be selected with:

df.select(pl.col(pl.String))

and a Dataframe's rows can be filtered with a regex pattern for a single column, with something like:

df.filter(pl.col("feature").str.contains("dangerous"))

How can a DataFrame be filtered with a list of regex patterns that could appear in any string column? I.e., if any string in a row matches any regex pattern, then keep that entire row, discard the rest.

EDIT 1

Here's a generated df and patterns to test functionality and performance.

import random
from faker import Faker
import polars as pl

random.seed(42)
Faker.seed(42)

faker = Faker()

df_len = 10000

df = pl.DataFrame(
    [
        pl.Series("a", [random.randint(0, 511) for _ in range(df_len)]).cast(pl.Binary),
        pl.Series("b", [random.randint(0, 1) for _ in range(df_len)]).cast(pl.Boolean),
        pl.Series("c", faker.sentences(df_len), pl.String),
        pl.Series("d", [random.randint(0, 255) for _ in range(df_len)], pl.UInt8),
        pl.Series("e", faker.words(df_len), pl.String),
        pl.Series(
            "f",
            [random.randint(0, 255) * random.TWOPI for _ in range(df_len)],
            pl.Float32,
        ),
        pl.Series("g", faker.words(df_len), pl.String),
    ]
)

patterns = [r"(?i)dangerous", r"always", r"(?i)prevent"]

print(df) yields:

shape: (10_000, 7)
┌────────┬───────┬─────────────────────────────────┬─────┬───────────┬─────────────┬──────────┐
│ a      ┆ b     ┆ c                               ┆ d   ┆ e         ┆ f           ┆ g        │
│ ---    ┆ ---   ┆ ---                             ┆ --- ┆ ---       ┆ ---         ┆ ---      │
│ binary ┆ bool  ┆ str                             ┆ u8  ┆ str       ┆ f32         ┆ str      │
╞════════╪═══════╪═════════════════════════════════╪═════╪═══════════╪═════════════╪══════════╡
│ b"114" ┆ false ┆ Agent every development say.    ┆ 164 ┆ let       ┆ 980.17688   ┆ yard     │
│ b"25"  ┆ true  ┆ Beautiful instead ahead despit… ┆ 210 ┆ reach     ┆ 458.672516  ┆ son      │
│ b"281" ┆ false ┆ Information last everything th… ┆ 230 ┆ arm       ┆ 50.265484   ┆ standard │
│ b"250" ┆ false ┆ Choice whatever from behavior … ┆ 29  ┆ operation ┆ 929.911438  ┆ final    │
│ b"228" ┆ false ┆ Page southern role movie win h… ┆ 242 ┆ coach     ┆ 1149.822876 ┆ none     │
│ …      ┆ …     ┆ …                               ┆ …   ┆ …         ┆ …           ┆ …        │
│ b"30"  ┆ true  ┆ Huge course partner.            ┆ 249 ┆ media     ┆ 1118.406982 ┆ movement │
│ b"33"  ┆ true  ┆ Building sign recently avoid u… ┆ 132 ┆ practice  ┆ 282.743347  ┆ big      │
│ b"346" ┆ false ┆ Paper will board.               ┆ 72  ┆ similar   ┆ 376.991119  ┆ just     │
│ b"431" ┆ true  ┆ Technology money worker spring… ┆ 140 ┆ sign      ┆ 94.24778    ┆ audience │
│ b"267" ┆ false ┆ A third traditional ago.        ┆ 40  ┆ available ┆ 615.752136  ┆ always   │
└────────┴───────┴─────────────────────────────────┴─────┴───────────┴─────────────┴──────────┘

EDIT 2

Using @jqurious's answer (the fastest so far), the correct output of df.filter(pl.any_horizontal(pl.col(pl.String).str.contains(regex))) is:

shape: (146, 7)
┌────────┬───────┬─────────────────────────────────┬─────┬───────────┬─────────────┬──────────┐
│ a      ┆ b     ┆ c                               ┆ d   ┆ e         ┆ f           ┆ g        │
│ ---    ┆ ---   ┆ ---                             ┆ --- ┆ ---       ┆ ---         ┆ ---      │
│ binary ┆ bool  ┆ str                             ┆ u8  ┆ str       ┆ f32         ┆ str      │
╞════════╪═══════╪═════════════════════════════════╪═════╪═══════════╪═════════════╪══════════╡
│ b"57"  ┆ true  ┆ During prevent accept seem sho… ┆ 137 ┆ various   ┆ 471.238892  ┆ customer │
│ b"269" ┆ true  ┆ Ball always it focus economy b… ┆ 179 ┆ key       ┆ 471.238892  ┆ guy      │
│ b"250" ┆ false ┆ Admit attack energy always.     ┆ 175 ┆ purpose   ┆ 1281.769775 ┆ wonder   │
│ b"82"  ┆ false ┆ Beyond prevent entire staff.    ┆ 242 ┆ hair      ┆ 904.778687  ┆ around   │
│ b"186" ┆ false ┆ Suffer accept letter visit alw… ┆ 134 ┆ magazine  ┆ 12.566371   ┆ dream    │
│ …      ┆ …     ┆ …                               ┆ …   ┆ …         ┆ …           ┆ …        │
│ b"464" ┆ true  ┆ Kid prevent avoid quite brothe… ┆ 153 ┆ visit     ┆ 879.645935  ┆ anything │
│ b"426" ┆ true  ┆ Your sure piece simple always … ┆ 247 ┆ recently  ┆ 1055.575073 ┆ laugh    │
│ b"403" ┆ false ┆ Difference all machine let cha… ┆ 178 ┆ former    ┆ 1061.858276 ┆ always   │
│ b"184" ┆ true  ┆ Morning carry event tell preve… ┆ 3   ┆ entire    ┆ 1432.566284 ┆ hit      │
│ b"267" ┆ false ┆ A third traditional ago.        ┆ 40  ┆ available ┆ 615.752136  ┆ always   │
└────────┴───────┴─────────────────────────────────┴─────┴───────────┴─────────────┴──────────┘

Solution

  • You could turn the list into a single regex.

    regex = "|".join(
       f"(?:{pattern})" for pattern in 
       sorted(patterns, key=len, reverse=True)
    )
    

    And use .any_horizontal() to get a single boolean for each row to .filter() with.

    df.filter(pl.any_horizontal(pl.col(pl.String).str.contains(regex)))
    
    shape: (146, 7)
    ┌────────┬───────┬─────────────────────────────────┬─────┬───────────┬─────────────┬──────────┐
    │ a      ┆ b     ┆ c                               ┆ d   ┆ e         ┆ f           ┆ g        │
    │ ---    ┆ ---   ┆ ---                             ┆ --- ┆ ---       ┆ ---         ┆ ---      │
    │ binary ┆ bool  ┆ str                             ┆ u8  ┆ str       ┆ f32         ┆ str      │
    ╞════════╪═══════╪═════════════════════════════════╪═════╪═══════════╪═════════════╪══════════╡
    │ b"57"  ┆ true  ┆ During prevent accept seem sho… ┆ 137 ┆ various   ┆ 471.238892  ┆ customer │
    │ b"269" ┆ true  ┆ Ball always it focus economy b… ┆ 179 ┆ key       ┆ 471.238892  ┆ guy      │
    │ b"250" ┆ false ┆ Admit attack energy always.     ┆ 175 ┆ purpose   ┆ 1281.769775 ┆ wonder   │
    │ b"82"  ┆ false ┆ Beyond prevent entire staff.    ┆ 242 ┆ hair      ┆ 904.778687  ┆ around   │
    │ b"186" ┆ false ┆ Suffer accept letter visit alw… ┆ 134 ┆ magazine  ┆ 12.566371   ┆ dream    │
    │ …      ┆ …     ┆ …                               ┆ …   ┆ …         ┆ …           ┆ …        │
    │ b"464" ┆ true  ┆ Kid prevent avoid quite brothe… ┆ 153 ┆ visit     ┆ 879.645935  ┆ anything │
    │ b"426" ┆ true  ┆ Your sure piece simple always … ┆ 247 ┆ recently  ┆ 1055.575073 ┆ laugh    │
    │ b"403" ┆ false ┆ Difference all machine let cha… ┆ 178 ┆ former    ┆ 1061.858276 ┆ always   │
    │ b"184" ┆ true  ┆ Morning carry event tell preve… ┆ 3   ┆ entire    ┆ 1432.566284 ┆ hit      │
    │ b"267" ┆ false ┆ A third traditional ago.        ┆ 40  ┆ available ┆ 615.752136  ┆ always   │
    └────────┴───────┴─────────────────────────────────┴─────┴───────────┴─────────────┴──────────┘