polars DataFrame - search strings from list

I need to search in string that contains a substring. I am looking for the efficient way to do so.

Slow version:

import polars as pl

def search_text(queries, text):
    return [query for query in queries if query in text]


pl_df = pl.DataFrame( {
        "Title": ["I am aa", "I am bbob"]
    })

queries = ['aa', 'bb']

pl_df = pl_df.with_columns(pl.col('Title').map_elements(lambda text: search_text(queries, text)).alias('Title_match'))

print(pl_df)

shape: (2, 2)
┌───────────┬─────────────┐
│ Title     ┆ Title_match │
│ ---       ┆ ---         │
│ str       ┆ list[str]   │
╞═══════════╪═════════════╡
│ I am aa   ┆ ["aa"]      │
│ I am bbob ┆ ["bb"]      │
└───────────┴─────────────┘

Solution

A common approach is to create a single regex delimited by |.

You want to sort your queries by length descending and run them through .str.escape_regex() to prevent possible false matches (e.g. if they could contain metachars such as .)

queries = ["aa", "bb", "bo"]

pattern = (
   pl.Series(sorted(queries, key=len, reverse=True))
     .str.escape_regex()
     .str.join("|")
)

pattern.item()
# 'aa|bb|bo'

df.with_columns(
   pl.col("Title").str.extract_all(pattern).alias("Title_match")
)

shape: (2, 2)
┌───────────┬─────────────┐
│ Title     ┆ Title_match │
│ ---       ┆ ---         │
│ str       ┆ list[str]   │
╞═══════════╪═════════════╡
│ I am aa   ┆ ["aa"]      │
│ I am bbob ┆ ["bb"]      │ # NOTE: `bo` is not matched
└───────────┴─────────────┘

Alternatively, there is .str.extract_many() for non-regex (Aho-Corasick) matching.

It also supports overlapping matches, if needed.

df.with_columns(
   pl.col("Title").str.extract_many(queries, overlapping=True)
     .alias("Title_match")
)

shape: (2, 2)
┌───────────┬──────────────┐
│ Title     ┆ Title_match  │
│ ---       ┆ ---          │
│ str       ┆ list[str]    │
╞═══════════╪══════════════╡
│ I am aa   ┆ ["aa"]       │
│ I am bbob ┆ ["bb", "bo"] │
└───────────┴──────────────┘