String columns can be selected with:
df.select(pl.col(pl.String))
and a Dataframe's rows can be filtered with a regex pattern for a single column, with something like:
df.filter(pl.col("feature").str.contains("dangerous"))
How can a DataFrame be filtered with a list of regex patterns that could appear in any string column? I.e., if any string in a row matches any regex pattern, then keep that entire row, discard the rest.
EDIT 1
Here's a generated df
and patterns
to test functionality and performance.
import random
from faker import Faker
import polars as pl
random.seed(42)
Faker.seed(42)
faker = Faker()
df_len = 10000
df = pl.DataFrame(
[
pl.Series("a", [random.randint(0, 511) for _ in range(df_len)]).cast(pl.Binary),
pl.Series("b", [random.randint(0, 1) for _ in range(df_len)]).cast(pl.Boolean),
pl.Series("c", faker.sentences(df_len), pl.String),
pl.Series("d", [random.randint(0, 255) for _ in range(df_len)], pl.UInt8),
pl.Series("e", faker.words(df_len), pl.String),
pl.Series(
"f",
[random.randint(0, 255) * random.TWOPI for _ in range(df_len)],
pl.Float32,
),
pl.Series("g", faker.words(df_len), pl.String),
]
)
patterns = [r"(?i)dangerous", r"always", r"(?i)prevent"]
print(df)
yields:
shape: (10_000, 7)
┌────────┬───────┬─────────────────────────────────┬─────┬───────────┬─────────────┬──────────┐
│ a ┆ b ┆ c ┆ d ┆ e ┆ f ┆ g │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ binary ┆ bool ┆ str ┆ u8 ┆ str ┆ f32 ┆ str │
╞════════╪═══════╪═════════════════════════════════╪═════╪═══════════╪═════════════╪══════════╡
│ b"114" ┆ false ┆ Agent every development say. ┆ 164 ┆ let ┆ 980.17688 ┆ yard │
│ b"25" ┆ true ┆ Beautiful instead ahead despit… ┆ 210 ┆ reach ┆ 458.672516 ┆ son │
│ b"281" ┆ false ┆ Information last everything th… ┆ 230 ┆ arm ┆ 50.265484 ┆ standard │
│ b"250" ┆ false ┆ Choice whatever from behavior … ┆ 29 ┆ operation ┆ 929.911438 ┆ final │
│ b"228" ┆ false ┆ Page southern role movie win h… ┆ 242 ┆ coach ┆ 1149.822876 ┆ none │
│ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … │
│ b"30" ┆ true ┆ Huge course partner. ┆ 249 ┆ media ┆ 1118.406982 ┆ movement │
│ b"33" ┆ true ┆ Building sign recently avoid u… ┆ 132 ┆ practice ┆ 282.743347 ┆ big │
│ b"346" ┆ false ┆ Paper will board. ┆ 72 ┆ similar ┆ 376.991119 ┆ just │
│ b"431" ┆ true ┆ Technology money worker spring… ┆ 140 ┆ sign ┆ 94.24778 ┆ audience │
│ b"267" ┆ false ┆ A third traditional ago. ┆ 40 ┆ available ┆ 615.752136 ┆ always │
└────────┴───────┴─────────────────────────────────┴─────┴───────────┴─────────────┴──────────┘
EDIT 2
Using @jqurious's answer (the fastest so far), the correct output of df.filter(pl.any_horizontal(pl.col(pl.String).str.contains(regex)))
is:
shape: (146, 7)
┌────────┬───────┬─────────────────────────────────┬─────┬───────────┬─────────────┬──────────┐
│ a ┆ b ┆ c ┆ d ┆ e ┆ f ┆ g │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ binary ┆ bool ┆ str ┆ u8 ┆ str ┆ f32 ┆ str │
╞════════╪═══════╪═════════════════════════════════╪═════╪═══════════╪═════════════╪══════════╡
│ b"57" ┆ true ┆ During prevent accept seem sho… ┆ 137 ┆ various ┆ 471.238892 ┆ customer │
│ b"269" ┆ true ┆ Ball always it focus economy b… ┆ 179 ┆ key ┆ 471.238892 ┆ guy │
│ b"250" ┆ false ┆ Admit attack energy always. ┆ 175 ┆ purpose ┆ 1281.769775 ┆ wonder │
│ b"82" ┆ false ┆ Beyond prevent entire staff. ┆ 242 ┆ hair ┆ 904.778687 ┆ around │
│ b"186" ┆ false ┆ Suffer accept letter visit alw… ┆ 134 ┆ magazine ┆ 12.566371 ┆ dream │
│ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … │
│ b"464" ┆ true ┆ Kid prevent avoid quite brothe… ┆ 153 ┆ visit ┆ 879.645935 ┆ anything │
│ b"426" ┆ true ┆ Your sure piece simple always … ┆ 247 ┆ recently ┆ 1055.575073 ┆ laugh │
│ b"403" ┆ false ┆ Difference all machine let cha… ┆ 178 ┆ former ┆ 1061.858276 ┆ always │
│ b"184" ┆ true ┆ Morning carry event tell preve… ┆ 3 ┆ entire ┆ 1432.566284 ┆ hit │
│ b"267" ┆ false ┆ A third traditional ago. ┆ 40 ┆ available ┆ 615.752136 ┆ always │
└────────┴───────┴─────────────────────────────────┴─────┴───────────┴─────────────┴──────────┘
You could turn the list into a single regex.
regex = "|".join(
f"(?:{pattern})" for pattern in
sorted(patterns, key=len, reverse=True)
)
And use .any_horizontal()
to get a single boolean for each row to .filter()
with.
df.filter(pl.any_horizontal(pl.col(pl.String).str.contains(regex)))
shape: (146, 7)
┌────────┬───────┬─────────────────────────────────┬─────┬───────────┬─────────────┬──────────┐
│ a ┆ b ┆ c ┆ d ┆ e ┆ f ┆ g │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ binary ┆ bool ┆ str ┆ u8 ┆ str ┆ f32 ┆ str │
╞════════╪═══════╪═════════════════════════════════╪═════╪═══════════╪═════════════╪══════════╡
│ b"57" ┆ true ┆ During prevent accept seem sho… ┆ 137 ┆ various ┆ 471.238892 ┆ customer │
│ b"269" ┆ true ┆ Ball always it focus economy b… ┆ 179 ┆ key ┆ 471.238892 ┆ guy │
│ b"250" ┆ false ┆ Admit attack energy always. ┆ 175 ┆ purpose ┆ 1281.769775 ┆ wonder │
│ b"82" ┆ false ┆ Beyond prevent entire staff. ┆ 242 ┆ hair ┆ 904.778687 ┆ around │
│ b"186" ┆ false ┆ Suffer accept letter visit alw… ┆ 134 ┆ magazine ┆ 12.566371 ┆ dream │
│ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … │
│ b"464" ┆ true ┆ Kid prevent avoid quite brothe… ┆ 153 ┆ visit ┆ 879.645935 ┆ anything │
│ b"426" ┆ true ┆ Your sure piece simple always … ┆ 247 ┆ recently ┆ 1055.575073 ┆ laugh │
│ b"403" ┆ false ┆ Difference all machine let cha… ┆ 178 ┆ former ┆ 1061.858276 ┆ always │
│ b"184" ┆ true ┆ Morning carry event tell preve… ┆ 3 ┆ entire ┆ 1432.566284 ┆ hit │
│ b"267" ┆ false ┆ A third traditional ago. ┆ 40 ┆ available ┆ 615.752136 ┆ always │
└────────┴───────┴─────────────────────────────────┴─────┴───────────┴─────────────┴──────────┘