Search code examples
pythonpandasdataframeoptimizationpython-polars

Can I optimize this cpu-bound pandas code with polars?


I have this pandas code:

def last_non_null(s):
    return s.dropna().iloc[-1] if not s.dropna().empty else np.nan

def merge_rows_of_final_df(df_final):
    # Group by columns A, B, and C
    cols = ['A', 'B', 'C']

    # Apply function to each column not in the subset
    # NOTE: the other columns have np.float32 as dtype
    agg_dict = {col: last_non_null for col in df_final.columns.difference(cols)}

    # Group and aggregate
    return df_final.groupby(cols).agg(agg_dict).reset_index()

df_merged = merge_rows_of_final_df(df_final)

Can I optimize this with polars? If so, can someone provide me the polars snippet with the same output? The idea is that final_df is the result of concat a list of dataframes that have duplicates on the subset columns A,B,C. I wish to merge the rows of those groups by taking the last/first not-null value of each column.

Example input:

df = pd.DataFrame({
    "A":[1,1,1,1,4,4,4],
    "B":[2,2,2,2,5,5,5],
    "C":[3,3,3,3,6,6,6],
    "D":[4, np.nan, np.nan, np.nan, 1, np.nan, np.nan],
    "E":[np.nan, np.nan, np.nan, 7, np.nan, np.nan, 2],
    "F":[np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan],
})

Example output:

    A   B   C   D   E   F
0   1   2   3   4.0 7.0 NaN
1   4   5   6   1.0 2.0 NaN

Solution

  • You can drop_nulls() and then take the last()

    df = pl.DataFrame({
        "A":[1,1,1,1,4,4,4],
        "B":[2,2,2,2,5,5,5],
        "C":[3,3,3,3,6,6,6],
        "D":[4, None, None, None, 1, None, None],
        "E":[None, None, None, 7, None, None, 2],
        "F":[None, None, None, None, None, None, None],
    })
    
    df.group_by("A", "B", "C").agg(
        pl.all().drop_nulls().last()
    )
    
    shape: (2, 6)
    ┌─────┬─────┬─────┬─────┬─────┬──────┐
    │ A   ┆ B   ┆ C   ┆ D   ┆ E   ┆ F    │
    │ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ ---  │
    │ i64 ┆ i64 ┆ i64 ┆ i64 ┆ i64 ┆ null │
    ╞═════╪═════╪═════╪═════╪═════╪══════╡
    │ 1   ┆ 2   ┆ 3   ┆ 4   ┆ 7   ┆ null │
    │ 4   ┆ 5   ┆ 6   ┆ 1   ┆ 2   ┆ null │
    └─────┴─────┴─────┴─────┴─────┴──────┘