I need to search in string that contains a substring. I am looking for the efficient way to do so.
Slow version:
import polars as pl
def search_text(queries, text):
return [query for query in queries if query in text]
pl_df = pl.DataFrame( {
"Title": ["I am aa", "I am bbob"]
})
queries = ['aa', 'bb']
pl_df = pl_df.with_columns(pl.col('Title').map_elements(lambda text: search_text(queries, text)).alias('Title_match'))
print(pl_df)
shape: (2, 2)
┌───────────┬─────────────┐
│ Title ┆ Title_match │
│ --- ┆ --- │
│ str ┆ list[str] │
╞═══════════╪═════════════╡
│ I am aa ┆ ["aa"] │
│ I am bbob ┆ ["bb"] │
└───────────┴─────────────┘
A common approach is to create a single regex delimited by |
.
You want to sort your queries by length descending and run them through .str.escape_regex()
to prevent possible false matches (e.g. if they could contain metachars such as .
)
queries = ["aa", "bb", "bo"]
pattern = (
pl.Series(sorted(queries, key=len, reverse=True))
.str.escape_regex()
.str.join("|")
)
pattern.item()
# 'aa|bb|bo'
df.with_columns(
pl.col("Title").str.extract_all(pattern).alias("Title_match")
)
shape: (2, 2)
┌───────────┬─────────────┐
│ Title ┆ Title_match │
│ --- ┆ --- │
│ str ┆ list[str] │
╞═══════════╪═════════════╡
│ I am aa ┆ ["aa"] │
│ I am bbob ┆ ["bb"] │ # NOTE: `bo` is not matched
└───────────┴─────────────┘
Alternatively, there is .str.extract_many()
for non-regex (Aho-Corasick) matching.
It also supports overlapping matches, if needed.
df.with_columns(
pl.col("Title").str.extract_many(queries, overlapping=True)
.alias("Title_match")
)
shape: (2, 2)
┌───────────┬──────────────┐
│ Title ┆ Title_match │
│ --- ┆ --- │
│ str ┆ list[str] │
╞═══════════╪══════════════╡
│ I am aa ┆ ["aa"] │
│ I am bbob ┆ ["bb", "bo"] │
└───────────┴──────────────┘