Search code examples
pythonpandasgroup-bypython-polarspandas-apply

Get correlation per groupby/apply in Python Polars


I have a pandas DataFrame df:

d = {'era': ["a", "a", "b","b","c", "c"], 'feature1': [3, 4, 5, 6, 7, 8], 'feature2': [7, 8, 9, 10, 11, 12], 'target': [1, 2, 3, 4, 5 ,6]}
df = pd.DataFrame(data=d)

And I want to apply a correlation between the feature_cols = ['feature1', 'feature2'] and the TARGET_COL = 'target' for each era:

corrs_split = (
          training_data
          .groupby("era")
          .apply(lambda d: d[feature_cols].corrwith(d[TARGET_COL]))
)

I've been trying to get this done with Polars, but I can't get a polars dataframe with a column for each different era and the correlations for each feature. The maximum I've got, is a single column, with all the correlations calculated, but without the era as index and not discriminated by feature.


Solution

  • Here's the polars equivalent of that code. You can do this by combining group_by() and agg().

    import polars as pl
    
    d = {'era': ["a", "a", "b","b","c", "c"], 'feature1': [3, 4, 5, 6, 7, 8], 'feature2': [7, 8, 9, 10, 11, 12], 'target': [1, 2, 3, 4, 5 ,6]}
    df = pl.DataFrame(d)
    feature_cols = ['feature1', 'feature2']
    TARGET_COL = 'target'
    
    agg_cols = []
    for feature_col in feature_cols:
        agg_cols += [pl.corr(feature_col, TARGET_COL)]
    print(df.group_by("era").agg(agg_cols))
    

    Output:

    shape: (3, 3)
    ┌─────┬──────────┬──────────┐
    │ era ┆ feature1 ┆ feature2 │
    │ --- ┆ ---      ┆ ---      │
    │ str ┆ f64      ┆ f64      │
    ╞═════╪══════════╪══════════╡
    │ c   ┆ 1.0      ┆ 1.0      │
    │ b   ┆ 1.0      ┆ 1.0      │
    │ a   ┆ 1.0      ┆ 1.0      │
    └─────┴──────────┴──────────┘
    

    (Order may be different for you.)