Search code examples
python-polarsrust-polars

Fast apply of a function to Polars Dataframe


What are the fastest ways to apply functions to polars DataFrames - pl.DataFrame or pl.internals.lazy_frame.LazyFrame? This question is piggy-backing off Apply Function to all columns of a Polars-DataFrame

I am trying to concat all columns and hash the value using hashlib in python standard library. The function I am using is below:

import hashlib

def hash_row(row):
    os.environ['PYTHONHASHSEED'] = "0"
    row = str(row).encode('utf-8')
    return hashlib.sha256(row).hexdigest()

However given that this function requires a string as input, means this function needs to be applied to every cell within a pl.Series. Working with a small amount of data, should be okay, but when we have closer to 100m rows this becomes very problematic. The question for this thread is how can we apply such a function in the most-performant way across an entire Polars Series?

Pandas

Offers a few options to create new columns, and some are more performant than others.

df['new_col'] = df['some_col'] * 100 # vectorized calls

Another option is to create custom functions for row-wise operations.

def apply_func(row):
   return row['some_col'] + row['another_col']

df['new_col'] = df.apply(lambda row: apply_func(row), axis=1) # using apply operations

From my experience, the fastest way is to create numpy vectorized solutions.

import numpy as np

def np_func(some_col, another_col):
   return some_col + another_col

vec_func = np.vectorize(np_func)

df['new_col'] = vec_func(df['some_col'].values, df['another_col'].values)

Polars

What is the best solution for Polars?


Solution

  • Let's start with this data of various types:

    import polars as pl
    
    df = pl.DataFrame(
        {
            "col_int": [1, 2, 3, 4],
            "col_float": [10.0, 20, 30, 40],
            "col_bool": [True, False, True, False],
            "col_str": ["2020-01-01"] * 4
        }
    ).with_columns(pl.col("col_str").str.to_date().alias("col_date"))
    df
    
    shape: (4, 5)
    ┌─────────┬───────────┬──────────┬────────────┬────────────┐
    │ col_int ┆ col_float ┆ col_bool ┆ col_str    ┆ col_date   │
    │ ---     ┆ ---       ┆ ---      ┆ ---        ┆ ---        │
    │ i64     ┆ f64       ┆ bool     ┆ str        ┆ date       │
    ╞═════════╪═══════════╪══════════╪════════════╪════════════╡
    │ 1       ┆ 10.0      ┆ true     ┆ 2020-01-01 ┆ 2020-01-01 │
    │ 2       ┆ 20.0      ┆ false    ┆ 2020-01-01 ┆ 2020-01-01 │
    │ 3       ┆ 30.0      ┆ true     ┆ 2020-01-01 ┆ 2020-01-01 │
    │ 4       ┆ 40.0      ┆ false    ┆ 2020-01-01 ┆ 2020-01-01 │
    └─────────┴───────────┴──────────┴────────────┴────────────┘
    

    Polars: DataFrame.hash_rows

    I should first point out that Polars itself has a hash_rows function that will hash the rows of a DataFrame, without first needing to cast each column to a string.

    df.hash_rows()
    
    shape: (4,)
    Series: '' [u64]
    [
            16206777682454905786
            7386261536140378310
            3777361287274669406
            675264002871508281
    ]
    

    If you find this acceptable, then this would be the most performant solution. You can cast the resulting unsigned int to a string if you need to. Note: hash_rows is available only on a DataFrame, not a LazyFrame.

    Using polars.concat_str and map_elements

    If you need to use your own hashing solution, then I recommend using the polars.concat_str function to concatenate the values in each row to a string. From the documentation:

    polars.concat_str(exprs: Union[Sequence[Union[polars.internals.expr.Expr, str]], polars.internals.expr.Expr], sep: str = '') → polars.internals.expr.Expr

    Horizontally Concat Utf8 Series in linear time. Non utf8 columns are cast to utf8.

    So, for example, here is the resulting concatenation on our dataset.

    df.with_columns(
        pl.concat_str(pl.all()).alias('concatenated_cols')
    )
    
    shape: (4, 6)
    ┌─────────┬───────────┬──────────┬────────────┬────────────┬────────────────────────────────┐
    │ col_int ┆ col_float ┆ col_bool ┆ col_str    ┆ col_date   ┆ concatenated_cols              │
    │ ---     ┆ ---       ┆ ---      ┆ ---        ┆ ---        ┆ ---                            │
    │ i64     ┆ f64       ┆ bool     ┆ str        ┆ date       ┆ str                            │
    ╞═════════╪═══════════╪══════════╪════════════╪════════════╪════════════════════════════════╡
    │ 1       ┆ 10.0      ┆ true     ┆ 2020-01-01 ┆ 2020-01-01 ┆ 110.0true2020-01-012020-01-01  │
    │ 2       ┆ 20.0      ┆ false    ┆ 2020-01-01 ┆ 2020-01-01 ┆ 220.0false2020-01-012020-01-01 │
    │ 3       ┆ 30.0      ┆ true     ┆ 2020-01-01 ┆ 2020-01-01 ┆ 330.0true2020-01-012020-01-01  │
    │ 4       ┆ 40.0      ┆ false    ┆ 2020-01-01 ┆ 2020-01-01 ┆ 440.0false2020-01-012020-01-01 │
    └─────────┴───────────┴──────────┴────────────┴────────────┴────────────────────────────────┘
    

    Taking the next step and using the map_elements method and your function would yield:

    df.with_columns(
        pl.concat_str(pl.all()).map_elements(hash_row).alias('hash')
    )
    
    shape: (4, 6)
    ┌─────────┬───────────┬──────────┬────────────┬────────────┬─────────────────────────────────────┐
    │ col_int ┆ col_float ┆ col_bool ┆ col_str    ┆ col_date   ┆ hash                                │
    │ ---     ┆ ---       ┆ ---      ┆ ---        ┆ ---        ┆ ---                                 │
    │ i64     ┆ f64       ┆ bool     ┆ str        ┆ date       ┆ str                                 │
    ╞═════════╪═══════════╪══════════╪════════════╪════════════╪═════════════════════════════════════╡
    │ 1       ┆ 10.0      ┆ true     ┆ 2020-01-01 ┆ 2020-01-01 ┆ 1826eb9c6aeb0abcdd2999a76eee576e... │
    │ 2       ┆ 20.0      ┆ false    ┆ 2020-01-01 ┆ 2020-01-01 ┆ ea50f5b11957bfc92b5ab7545b3ac12c... │
    │ 3       ┆ 30.0      ┆ true     ┆ 2020-01-01 ┆ 2020-01-01 ┆ eef039d8dedadcc282d6fa9473e071e8... │
    │ 4       ┆ 40.0      ┆ false    ┆ 2020-01-01 ┆ 2020-01-01 ┆ dcc5c57e0b5fdf15320a84c6839b0e3d... │
    └─────────┴───────────┴──────────┴────────────┴────────────┴─────────────────────────────────────┘
    

    Please remember that any time Polars calls external libraries or runs Python bytecode, you are subject to the Python GIL, which means single-threaded performance - no matter how you code it. From the Polars User Guide section Do Not Kill The Parallelization!:

    We have all heard that Python is slow, and does "not scale." Besides the overhead of running "slow" bytecode, Python has to remain within the constraints of the Global Interpreter Lock (GIL). This means that if you were to use a lambda or a custom Python function to apply during a parallelized phase, Polars speed is capped running Python code preventing any multiple threads from executing the function.

    This all feels terribly limiting, especially because we often need those lambda functions in a .groupby() step, for example. This approach is still supported by Polars, but keeping in mind bytecode and the GIL costs have to be paid.

    To mitigate this, Polars implements a powerful syntax defined not only in its lazy API, but also in its eager API.