Search code examples
pythonpython-polars

python polars : apply a custom function efficiently on parts of dataframe


I'm trying to optimize map_elements in polars the way I did in pandas (which might entirely be the wrong way to proceed...).

I have a function I'm not managing but that I have to apply to parts of my dataframe; to assert reproducibility, let's say this is lat_lon_parse from the lat_lon_parser package.

from lat_lon_parser import parse as lat_lon_parse

def test_lat_lon_parse(x):
    try:
        print(f"parsing {x}")
        return lat_lon_parse(x)
    except Exception:
        return None

Let's also say that your dataframe, reflecting true data, contains mixed data.

import polars as pl
df = pl.DataFrame({'A':["1", "2", "5°N", "4°S"], "B":[1, 2, 3, 4]})

For efficiency's sake, I don't want to run test_lat_lon_parse on rows 1 and 2 (as I could do a simple .cast() operation and get the same result). What is the state-of-the-art way to proceed ?

In pandas, I would have computed an index and applied my function on the subset of the dataframe only.

In polars, I see two ways of proceeding :

mask = pl.col('A').str.contains('°')

# way #1
def way_1(df):
  return df.with_columns(
    pl.when(mask)
    .then(pl.col('A').map_elements(test_lat_lon_parse))
    .otherwise(pl.col('A'))
    .cast(pl.Float64)
    )

# way #2
def way_2(df):
  return pl.concat([
    df.filter(mask).with_columns(pl.col('A').map_elements(test_lat_lon_parse).alias('dummy')),
    df.filter(~mask).with_columns(pl.col('A').cast(pl.Float64).alias('dummy'))
    ])

You will see that way #1 will apply the function on each row (hence the prints); note that this is not as trivial as it seems as your applied function may also trigger exceptions when encountering strange data. For instance, you can't cast to pl.Float64 inside the otherwise part of the expression because it will be applied to the whole series - and fail.

The way #2 will execute the function on the only subset I specified but will alter the dataframe's order.

I used timeits to compare the two processes. I got those results:

# way #1: 
%timeit way_1(df)
>> 443 µs ± 76.2 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

# way #2:
%timeit way_2(df)
>> 1.49 ms ± 462 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

When I increase the dataframe's size, it shows (not unsurprisingly) that way #1 does not handle scalibility best:

df = pl.DataFrame({'A':["1", "2", "5°N", "4°S"]*10000, "B":[1, 2, 3, 4]*10000})

# way #1: 
%timeit way_1(df)
>> 400 ms ± 48.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

# way #2:
>> 234 ms ± 59.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Is my understanding of polars' proceeding right? Are there better ways to handle this (and how do they handle scalibility)?


Solution

  • .map_elements() defaults to skip_nulls=True which means in this case you could use .when().then().map_elements to call the UDF for only the non-null cases.

    df.with_columns(
       pl.when(mask)
         .then(pl.col.A)
         .map_elements(test_lat_lon_parse, return_dtype=pl.Float64)
    )
    
    # parsing 5°N
    # parsing 4°S
    
    shape: (4, 1)
    ┌──────┐
    │ A    │
    │ ---  │
    │ f64  │
    ╞══════╡
    │ null │
    │ null │
    │ 5.0  │
    │ -4.0 │
    └──────┘
    

    pl.coalesce() (or .fill_null()) can be used to create a single column containing the cast/map_elements results.

    df.with_columns(
       pl.coalesce(
           pl.col.A.cast(pl.Float64, strict=False),
           pl.when(mask)
             .then(pl.col.A)
             .map_elements(test_lat_lon_parse, return_dtype=pl.Float64)
       )
    )
    
    shape: (4, 2)
    ┌──────┬─────┐
    │ A    ┆ B   │
    │ ---  ┆ --- │
    │ f64  ┆ i64 │
    ╞══════╪═════╡
    │ 1.0  ┆ 1   │
    │ 2.0  ┆ 2   │
    │ 5.0  ┆ 3   │
    │ -4.0 ┆ 4   │
    └──────┴─────┘