Slow performance of Python Polars in applying pl.element().filter

I have a polars df with a column of lists of string elements. Those lists have empty elements that I need to filter out. I used the following method:

df_main = df_main.with_columns( 
            df_main['ColumnList'].list.eval(
                pl.element().filter(
                    pl.element() != ""))
                    .alias('ColumnList') 
            )

However, it takes long time. (Around 12 seconds) Another slow step is to strip the spaces from each element in the lists (Around 10 seconds):

df_main = df_main.with_columns(df_main['ColumnList'].list.eval(pl.element().str.strip_chars()).alias('ColumnList') )

I have tried to use "map_elements" with a custom function to perform both required functions but it took more than minute:

def FilterEmptyAndTrim(inList):
    lst1 = list(filter(lambda x : x != '' , inList)) # filter out empty items in the list
    return [s.strip() for s in lst1] # trim the usefull item in the list
df_main = df_main.with_columns(
    (df_main['ColumnList'].map_elements(FilterEmptyAndTrim)).alias('ColumnList'))

I have used the same exact custom function in pandas transform method but took 4 seconds only.

Here is a sample data like my data:

import polars as pl
df_main = pl.DataFrame({
    "ColumnList": [
        ['a ' , 'b' , '' ,' c',  'd'  ],
        ['a ' ,  '' ,' b', '' ,  'd'  ],
        ['a ' ,' b' , '' ,' c',  'd'  ],
        ['a ' , 'b' ,' b', '' , ' d ' ],]
})
df_main

Any way to improve the performance?

Thanks in advance

Solution

Update

You can use set_difference to filter out the empty strings and then either use list.eval or explode/implode.over to strip the begin/end whitespace. The former is slightly slower on this tiny test case but is more verbose

`explode/implode`


(
 df_main
 .with_row_index('i')
 .select(
     pl.col("ColumnList").list.set_difference(pl.lit(['']))
     .explode().str.strip_chars().implode().over('i')
     )
 .drop('i'))

`list.eval`

(
 df_main
 .select(
     pl.col("ColumnList").list.set_difference(pl.lit(['']))
     .list.eval(pl.element().str.strip_chars())
     ))

Old answer

Explode is your friend.

For whatever reason, it turns out that list.eval is slow. Try to explode the list with a row index column. Despite the fact that this is many more lines of code and intuition might say it's got to be slower, it often is faster to avoid list.eval

(
    df_main
    .with_row_count('i')
    .explode('ColumnList')
    .filter(pl.col('ColumnList')!='')
    .with_columns(pl.col('ColumnList').str.strip_chars())
    .group_by('i',maintain_order=True)
    .agg(pl.col('ColumnList'))
    .drop('i')
    )
shape: (4, 1)
┌──────────────────────┐
│ ColumnList           │
│ ---                  │
│ list[str]            │
╞══════════════════════╡
│ ["a", "b", "c", "d"] │
│ ["a", "b", "d"]      │
│ ["a", "b", "c", "d"] │
│ ["a", "b", "c", "d"] │
└──────────────────────┘