Search code examples
python-polars

polars DataFrame - search strings from list


I need to search in string that contains a substring. I am looking for the efficient way to do so.

Slow version:

import polars as pl

def search_text(queries, text):
    return [query for query in queries if query in text]


pl_df = pl.DataFrame( {
        "Title": ["I am aa", "I am bbob"]
    })

queries = ['aa', 'bb']

pl_df = pl_df.with_columns(pl.col('Title').map_elements(lambda text: search_text(queries, text)).alias('Title_match'))

print(pl_df)
shape: (2, 2)
┌───────────┬─────────────┐
│ Title     ┆ Title_match │
│ ---       ┆ ---         │
│ str       ┆ list[str]   │
╞═══════════╪═════════════╡
│ I am aa   ┆ ["aa"]      │
│ I am bbob ┆ ["bb"]      │
└───────────┴─────────────┘

Solution

  • A common approach is to create a single regex delimited by |.

    You want to sort your queries by length descending and run them through .str.escape_regex() to prevent possible false matches (e.g. if they could contain metachars such as .)

    queries = ["aa", "bb", "bo"]
    
    pattern = (
       pl.Series(sorted(queries, key=len, reverse=True))
         .str.escape_regex()
         .str.join("|")
    )
    
    pattern.item()
    # 'aa|bb|bo'
    
    df.with_columns(
       pl.col("Title").str.extract_all(pattern).alias("Title_match")
    )   
    
    shape: (2, 2)
    ┌───────────┬─────────────┐
    │ Title     ┆ Title_match │
    │ ---       ┆ ---         │
    │ str       ┆ list[str]   │
    ╞═══════════╪═════════════╡
    │ I am aa   ┆ ["aa"]      │
    │ I am bbob ┆ ["bb"]      │ # NOTE: `bo` is not matched
    └───────────┴─────────────┘
    

    Alternatively, there is .str.extract_many() for non-regex (Aho-Corasick) matching.

    It also supports overlapping matches, if needed.

    df.with_columns(
       pl.col("Title").str.extract_many(queries, overlapping=True)
         .alias("Title_match")
    )   
    
    shape: (2, 2)
    ┌───────────┬──────────────┐
    │ Title     ┆ Title_match  │
    │ ---       ┆ ---          │
    │ str       ┆ list[str]    │
    ╞═══════════╪══════════════╡
    │ I am aa   ┆ ["aa"]       │
    │ I am bbob ┆ ["bb", "bo"] │
    └───────────┴──────────────┘