Search code examples
pythonpython-polars

How to translate pandas DataFrame operations to Polars in Python?


I am trying to convert some pandas DataFrame operations to Polars in Python, but I am running into difficulties, particularly with row-wise operations and element-wise comparisons. Here is the pandas code I am working with:

df_a = pd.DataFrame({
    "feature1": [1, 2, 3],
    "feature2": [7, 8, 9],
})

df_b = pd.DataFrame({
    "feature1": [3, 8, 2],
    "feature2": [7, 4, 9],
})

if selection_mode == 'option1':
    max_values = df_a.max(axis=1)
    selected_features = df_a.eq(max_values, axis=0)
    final_result = selected_features.mul(df_b).sum(axis=1) / selected_features.sum(axis=1)

elif selection_mode == 'option2':
    above_avg = df_a.ge(df_a.mean(axis=1), axis=0)
    combined_df = above_avg.mul(df_a).mul(df_b)
    sum_combined = combined_df.sum(axis=1)
    sum_above_avg = above_avg.mul(df_a).sum(axis=1)
    final_result = sum_combined / sum_above_avg

Any guidance on translating this pandas code to Polars would be greatly appreciated!


Solution

  • Polars has dedicated horizontal functions for "row-wise" operations.

    df_a.max_horizontal()
    
    shape: (3,)
    Series: 'max' [i64]
    [
        7
        8
        9
    ]
    

    For DataFrames, Polars will "broadcast" the operation across all columns if the right-hand side is a Series.

    df_a == df_a.max_horizontal() # df_a.select(pl.all() == pl.Series([7, 8, 9]))
    
    shape: (3, 2)
    ┌──────────┬──────────┐
    │ feature1 ┆ feature2 │
    │ ---      ┆ ---      │
    │ bool     ┆ bool     │
    ╞══════════╪══════════╡
    │ false    ┆ true     │
    │ false    ┆ true     │
    │ false    ┆ true     │
    └──────────┴──────────┘
    

    Option #1

    max_values = df_a.max_horizontal()
    selected_features = df_a == max_values
    
    final_result = (
        (selected_features * df_b).sum_horizontal() / selected_features.sum_horizontal()
    )
    

    Option #2

    above_avg = df_a >= df_a.mean_horizontal()
    combined_df = above_avg * df_a * df_b
    sum_combined = combined_df.sum_horizontal()
    sum_above_avg = (above_avg * df_a).sum_horizontal()
    
    final_result = sum_combined / sum_above_avg