I need to search in string that contains a substring. I am looking for the efficient way to do so.
Slow version:
import polars as pl
def search_text(queries, text):
return [query for query in queries if query in text]
pl_df = pl.DataFrame( {
"Title": ["I am aa", "I am bbob"]
})
queries = ['aa', 'bb']
pl_df = pl_df.with_columns(pl.col('Title').map_elements(lambda text: search_text(queries, text)).alias('Title_match'))
print(pl_df)
shape: (2, 2)
┌───────────┬─────────────┐
│ Title ┆ Title_match │
│ --- ┆ --- │
│ str ┆ list[str] │
╞═══════════╪═════════════╡
│ I am aa ┆ ["aa"] │
│ I am bbob ┆ ["bb"] │
└───────────┴─────────────┘
You can use .str.extract_many()
df.with_columns(Title_match = pl.col.Title.str.extract_many(queries))
shape: (2, 2)
┌───────────┬─────────────┐
│ Title ┆ Title_match │
│ --- ┆ --- │
│ str ┆ list[str] │
╞═══════════╪═════════════╡
│ I am aa ┆ ["aa"] │
│ I am bbob ┆ ["bb"] │
└───────────┴─────────────┘