Search code examples
arraylistpython-polars.whenisin

In Polars, is there a better way to only return items within a string if they match items in a list using .is_in?


Is there a better way to only return each pl.element() in a polars array if it matches an item contained within a list?

While it works, I get the error The predicate 'col("").is_in([Series])' in 'when->then->otherwise' is not a valid aggregation and might produce a different number of rows than the group_by operation would. This behavior is experimental and may be subject to change warning which leads me to believe there's probably a more concise/better way:

import polars as pl

terms = ['a', 'z']

(pl.LazyFrame({'a':['x y z']})
   .select(pl.col('a')
             .str.split(' ')
             .list.eval(pl.when(pl.element().is_in(terms))
                          .then(pl.element())
                          .otherwise(None))
             .list.drop_nulls()
             .list.join(' ')
           )
   .fetch()
 )

For posterity's sake, it replaces my previous attempt using .map_elements():

import polars as pl
import re

terms = ['a', 'z']

(pl.LazyFrame({'a':['x y z']})
   .select(pl.col('a')
             .str.split(' ')
             .map_elements(lambda x: ' '.join(list(set(re.findall('|'.join(terms), x)))),
                           return_dtype = pl.Utf8)
           )
   .fetch()
 )

Solution

  • @jqurious and @Dean MacGregor were exactly right, I just wanted to post an solution that explained the differences succinctly:

    terms = ['a', 'z']
    
    (pl.LazyFrame({'a':['x a y zebra']})
       .with_columns(only_whole_terms = pl.col('a')
                                          .str.split(' ')
                                          .list.set_intersection(terms),
                     each_term = pl.col('a').str.extract_all('|'.join(terms)),
                    )
       .fetch()
    )
    
    shape: (1, 3)
    ┌─────────────┬──────────────────┬─────────────────┐
    │ a           ┆ only_whole_terms ┆ each_term       │
    │ ---         ┆ ---              ┆ ---             │
    │ str         ┆ list[str]        ┆ list[str]       │
    ╞═════════════╪══════════════════╪═════════════════╡
    │ x a y zebra ┆ ["a"]            ┆ ["a", "z", "a"] │
    └─────────────┴──────────────────┴─────────────────┘
    
    

    Also, this closely related question adds a bit more.