I have a polars df with a column of lists of string elements. Those lists have empty elements that I need to filter out. I used the following method:
df_main = df_main.with_columns(
df_main['ColumnList'].list.eval(
pl.element().filter(
pl.element() != ""))
.alias('ColumnList')
)
However, it takes long time. (Around 12 seconds) Another slow step is to strip the spaces from each element in the lists (Around 10 seconds):
df_main = df_main.with_columns(df_main['ColumnList'].list.eval(pl.element().str.strip_chars()).alias('ColumnList') )
I have tried to use "map_elements" with a custom function to perform both required functions but it took more than minute:
def FilterEmptyAndTrim(inList):
lst1 = list(filter(lambda x : x != '' , inList)) # filter out empty items in the list
return [s.strip() for s in lst1] # trim the usefull item in the list
df_main = df_main.with_columns(
(df_main['ColumnList'].map_elements(FilterEmptyAndTrim)).alias('ColumnList'))
I have used the same exact custom function in pandas transform method but took 4 seconds only.
Here is a sample data like my data:
import polars as pl
df_main = pl.DataFrame({
"ColumnList": [
['a ' , 'b' , '' ,' c', 'd' ],
['a ' , '' ,' b', '' , 'd' ],
['a ' ,' b' , '' ,' c', 'd' ],
['a ' , 'b' ,' b', '' , ' d ' ],]
})
df_main
Any way to improve the performance?
Thanks in advance
You can use set_difference
to filter out the empty strings and then either use list.eval
or explode/implode.over
to strip the begin/end whitespace. The former is slightly slower on this tiny test case but is more verbose
explode/implode
(
df_main
.with_row_index('i')
.select(
pl.col("ColumnList").list.set_difference(pl.lit(['']))
.explode().str.strip_chars().implode().over('i')
)
.drop('i'))
list.eval
(
df_main
.select(
pl.col("ColumnList").list.set_difference(pl.lit(['']))
.list.eval(pl.element().str.strip_chars())
))
Explode is your friend.
For whatever reason, it turns out that list.eval
is slow. Try to explode the list with a row index column. Despite the fact that this is many more lines of code and intuition might say it's got to be slower, it often is faster to avoid list.eval
(
df_main
.with_row_count('i')
.explode('ColumnList')
.filter(pl.col('ColumnList')!='')
.with_columns(pl.col('ColumnList').str.strip_chars())
.group_by('i',maintain_order=True)
.agg(pl.col('ColumnList'))
.drop('i')
)
shape: (4, 1)
┌──────────────────────┐
│ ColumnList │
│ --- │
│ list[str] │
╞══════════════════════╡
│ ["a", "b", "c", "d"] │
│ ["a", "b", "d"] │
│ ["a", "b", "c", "d"] │
│ ["a", "b", "c", "d"] │
└──────────────────────┘