Search code examples
pythonpython-polars

Slow performance of Python Polars in applying pl.element().filter


I have a polars df with a column of lists of string elements. Those lists have empty elements that I need to filter out. I used the following method:

df_main = df_main.with_columns( 
            df_main['ColumnList'].list.eval(
                pl.element().filter(
                    pl.element() != ""))
                    .alias('ColumnList') 
            )

However, it takes long time. (Around 12 seconds) Another slow step is to strip the spaces from each element in the lists (Around 10 seconds):

df_main = df_main.with_columns(df_main['ColumnList'].list.eval(pl.element().str.strip_chars()).alias('ColumnList') )

I have tried to use "map_elements" with a custom function to perform both required functions but it took more than minute:

def FilterEmptyAndTrim(inList):
    lst1 = list(filter(lambda x : x != '' , inList)) # filter out empty items in the list
    return [s.strip() for s in lst1] # trim the usefull item in the list
df_main = df_main.with_columns(
    (df_main['ColumnList'].map_elements(FilterEmptyAndTrim)).alias('ColumnList'))

I have used the same exact custom function in pandas transform method but took 4 seconds only.

Here is a sample data like my data:

import polars as pl
df_main = pl.DataFrame({
    "ColumnList": [
        ['a ' , 'b' , '' ,' c',  'd'  ],
        ['a ' ,  '' ,' b', '' ,  'd'  ],
        ['a ' ,' b' , '' ,' c',  'd'  ],
        ['a ' , 'b' ,' b', '' , ' d ' ],]
})
df_main

Any way to improve the performance?

Thanks in advance


Solution

  • Update

    You can use set_difference to filter out the empty strings and then either use list.eval or explode/implode.over to strip the begin/end whitespace. The former is slightly slower on this tiny test case but is more verbose

    explode/implode
    
    (
     df_main
     .with_row_index('i')
     .select(
         pl.col("ColumnList").list.set_difference(pl.lit(['']))
         .explode().str.strip_chars().implode().over('i')
         )
     .drop('i'))
    
    list.eval
    (
     df_main
     .select(
         pl.col("ColumnList").list.set_difference(pl.lit(['']))
         .list.eval(pl.element().str.strip_chars())
         ))
    

    Old answer

    Explode is your friend.

    For whatever reason, it turns out that list.eval is slow. Try to explode the list with a row index column. Despite the fact that this is many more lines of code and intuition might say it's got to be slower, it often is faster to avoid list.eval

    (
        df_main
        .with_row_count('i')
        .explode('ColumnList')
        .filter(pl.col('ColumnList')!='')
        .with_columns(pl.col('ColumnList').str.strip_chars())
        .group_by('i',maintain_order=True)
        .agg(pl.col('ColumnList'))
        .drop('i')
        )
    shape: (4, 1)
    ┌──────────────────────┐
    │ ColumnList           │
    │ ---                  │
    │ list[str]            │
    ╞══════════════════════╡
    │ ["a", "b", "c", "d"] │
    │ ["a", "b", "d"]      │
    │ ["a", "b", "c", "d"] │
    │ ["a", "b", "c", "d"] │
    └──────────────────────┘