I'm trying to optimize map_elements
in polars the way I did in pandas (which might entirely be the wrong way to proceed...).
I have a function I'm not managing but that I have to apply to parts of my dataframe; to assert reproducibility, let's say this is lat_lon_parse
from the lat_lon_parser
package.
from lat_lon_parser import parse as lat_lon_parse
def test_lat_lon_parse(x):
try:
print(f"parsing {x}")
return lat_lon_parse(x)
except Exception:
return None
Let's also say that your dataframe, reflecting true data, contains mixed data.
import polars as pl
df = pl.DataFrame({'A':["1", "2", "5°N", "4°S"], "B":[1, 2, 3, 4]})
For efficiency's sake, I don't want to run test_lat_lon_parse
on rows 1 and 2 (as I could do a simple .cast()
operation and get the same result). What is the state-of-the-art way to proceed ?
In pandas, I would have computed an index and applied my function on the subset of the dataframe only.
In polars, I see two ways of proceeding :
mask = pl.col('A').str.contains('°')
# way #1
def way_1(df):
return df.with_columns(
pl.when(mask)
.then(pl.col('A').map_elements(test_lat_lon_parse))
.otherwise(pl.col('A'))
.cast(pl.Float64)
)
# way #2
def way_2(df):
return pl.concat([
df.filter(mask).with_columns(pl.col('A').map_elements(test_lat_lon_parse).alias('dummy')),
df.filter(~mask).with_columns(pl.col('A').cast(pl.Float64).alias('dummy'))
])
You will see that way #1 will apply the function on each row (hence the print
s); note that this is not as trivial as it seems as your applied function may also trigger exceptions when encountering strange data. For instance, you can't cast to pl.Float64
inside the otherwise
part of the expression because it will be applied to the whole series - and fail.
The way #2 will execute the function on the only subset I specified but will alter the dataframe's order.
I used timeit
s to compare the two processes. I got those results:
# way #1:
%timeit way_1(df)
>> 443 µs ± 76.2 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
# way #2:
%timeit way_2(df)
>> 1.49 ms ± 462 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
When I increase the dataframe's size, it shows (not unsurprisingly) that way #1 does not handle scalibility best:
df = pl.DataFrame({'A':["1", "2", "5°N", "4°S"]*10000, "B":[1, 2, 3, 4]*10000})
# way #1:
%timeit way_1(df)
>> 400 ms ± 48.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# way #2:
>> 234 ms ± 59.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Is my understanding of polars' proceeding right? Are there better ways to handle this (and how do they handle scalibility)?
.map_elements()
defaults to skip_nulls=True
which means in this case you could use .when().then().map_elements
to call the UDF for only the non-null cases.
df.with_columns(
pl.when(mask)
.then(pl.col.A)
.map_elements(test_lat_lon_parse, return_dtype=pl.Float64)
)
# parsing 5°N
# parsing 4°S
shape: (4, 1)
┌──────┐
│ A │
│ --- │
│ f64 │
╞══════╡
│ null │
│ null │
│ 5.0 │
│ -4.0 │
└──────┘
pl.coalesce()
(or .fill_null()
) can be used to create a single column containing the cast/map_elements results.
df.with_columns(
pl.coalesce(
pl.col.A.cast(pl.Float64, strict=False),
pl.when(mask)
.then(pl.col.A)
.map_elements(test_lat_lon_parse, return_dtype=pl.Float64)
)
)
shape: (4, 2)
┌──────┬─────┐
│ A ┆ B │
│ --- ┆ --- │
│ f64 ┆ i64 │
╞══════╪═════╡
│ 1.0 ┆ 1 │
│ 2.0 ┆ 2 │
│ 5.0 ┆ 3 │
│ -4.0 ┆ 4 │
└──────┴─────┘