Search code examples
stringlistpython-polarsfindallset-intersection

Make column of all matching substrings from a list that are found within a Polars string column


How do I return a column of all matching terms or substrings found within a string? I suspect there's a way to do it with pl.any_horizontal() as suggested in these comments but I can't quite piece it together.

import re

terms = ['a', 'This', 'e']

(pl.DataFrame({'col': 'This is a sentence'})
   .with_columns(matched_terms = pl.col('col').map_elements(lambda x: list(set(re.findall('|'.join(terms), x)))))
)

The column should return: ['a', 'This', 'e']

EDIT: The winning solution here: .str.extract_all('|'.join(terms)).list.unique() is different from this closely related question's winning solution: pl.col('col').str.split(' ').list.set_intersection(terms) because .set_intersection() doesn't get sub-strings of list elements (such as partial, not full, words).


Solution

  • I've included the accompanying term-matching columns, but the each_term column with pl.col('a').str.extract_all('|'.join(terms)) was the best solution for me.

    pl.Config.set_fmt_table_cell_list_len(4)
    
    terms = ['A', 'u', 'bug', 'g']
    
    (pl.DataFrame({'a': 'A bug in a rug.'})
     .select(has_term = pl.col('a').str.contains_any(terms),
             has_term2 = pl.col('a').str.contains('|'.join(terms)),
             has_term3 = pl.any_horizontal(pl.col("a").str.contains(t) for t in terms),
             
             each_term = pl.col('a').str.extract_all('|'.join(terms)),
             
             whole_terms = pl.col('a').str.split(' ').list.set_intersection(terms),
             n_matched_terms = pl.col('a').str.count_matches('|'.join(terms)),
            )
    )
    
    shape: (1, 6)
    ┌──────────┬───────────┬───────────┬────────────────────────┬──────────────┬─────────────────┐
    │ has_term ┆ has_term2 ┆ has_term3 ┆ each_term              ┆ whole_terms  ┆ n_matched_terms │
    │ ---      ┆ ---       ┆ ---       ┆ ---                    ┆ ---          ┆ ---             │
    │ bool     ┆ bool      ┆ bool      ┆ list[str]              ┆ list[str]    ┆ u32             │
    ╞══════════╪═══════════╪═══════════╪════════════════════════╪══════════════╪═════════════════╡
    │ true     ┆ true      ┆ true      ┆ ["A", "bug", "u", "g"] ┆ ["A", "bug"] ┆ 4               │
    └──────────┴───────────┴───────────┴────────────────────────┴──────────────┴─────────────────┘