Search code examples
pandaspandas-groupbycorrelationsklearn-pandas

Pandas Correlation Error Using Sklearn Metrics


I am trying to calculate r2 or r-squared over a large dataset with pandas and grouping the data by plant_name and month in a dataframe like "data1" shown below. The problem is that when I use sklearn metrics and a defined function, I obtain a result that is not consistent with a result that I obtain using the same data in "data1" in Excel. Here is the data in "data1"

    plant_name  month  year  wind_speed_obs  wind_speed_ms
0   BIG HORN I      1  2018        5.143830       6.012436
1   BIG HORN I      1  2019        4.556545       5.231855
2   BIG HORN I      1  2020        6.582890       7.866532
3   BIG HORN I      2  2018        7.904438       9.248810
4   BIG HORN I      2  2019        4.353567       5.115625
5   BIG HORN I      2  2020        7.376739       8.408046
6   BIG HORN I      3  2018        6.138197       6.922043
7   BIG HORN I      3  2019        3.881804       4.484274
8   BIG HORN I      3  2020        7.071029       7.347177
9   BIG HORN I      4  2018        7.106936       7.699861
10  BIG HORN I      4  2019        6.874942       7.575278
11  BIG HORN I      4  2020        6.855979       7.106250
12  BIG HORN I      5  2018        5.366054       6.510753
13  BIG HORN I      5  2019        5.657342       6.597581
14  BIG HORN I      5  2020        7.010745       7.247043
15  BIG HORN I      6  2018        6.399417       7.076528
16  BIG HORN I      6  2019        6.578241       7.556111
17  BIG HORN I      6  2020        7.120105       7.548194
18  BIG HORN I      7  2018        5.615110       6.123925
19  BIG HORN I      7  2019        6.212104       6.963441
20  BIG HORN I      7  2020        6.663250       6.972312
21  BIG HORN I      8  2018        5.303967       5.947312
22  BIG HORN I      8  2019        5.176691       6.209274
23  BIG HORN I      8  2020        6.093748       6.337634
24  BIG HORN I      9  2018        5.375531       5.878472
25  BIG HORN I      9  2019        6.126961       6.792500
26  BIG HORN I      9  2020        5.608530       6.028056
27  BIG HORN I     10  2018        4.466079       5.054973
28  BIG HORN I     10  2019        5.492795       6.326075
29  BIG HORN I     10  2020        7.103278       7.492070
30  BIG HORN I     11  2018        5.341987       5.889028
31  BIG HORN I     11  2019        4.887397       5.144028
32  BIG HORN I     11  2020        6.718649       7.150000
33  BIG HORN I     12  2018        5.099386       5.866935
34  BIG HORN I     12  2019        3.925717       4.234140
35  BIG HORN I     12  2020        5.589325       5.943145

Here is the code i'm using:

from sklearn.metrics import r2_score
def r2_rmse2( g ):
    r2 = r2_score( g['wind_speed_obs'], g['wind_speed_ms'] )
    #rmse = np.sqrt( mean_squared_error( g['wind_speed_obs'], g['wind_speed_ms'] ) )
    return pd.Series( dict(  r2 = r2 ) )
data1.groupby( ['plant_name','month'] ).apply( r2_rmse2 ).reset_index()

I obtain this result when applying the r2_rmse2 function above:

    plant_name  month         r2
0   BIG HORN I      1  -0.314771
1   BIG HORN I      2   0.529890
2   BIG HORN I      3   0.804066
3   BIG HORN I      4 -22.164720
4   BIG HORN I      5  -0.460690
5   BIG HORN I      6  -4.673359
6   BIG HORN I      7  -0.662166
7   BIG HORN I      8  -2.118815
8   BIG HORN I      9  -1.946566
9   BIG HORN I     10   0.662636
10  BIG HORN I     11   0.696896
11  BIG HORN I     12   0.446235

the correct result when i test the function in Excel that I should obtain applying the function is:

plant_name  month   r2
BIG HORN I  1   0.999975202
BIG HORN I  2   0.998459857
BIG HORN I  3   0.988712352
BIG HORN I  4   0.711649414
BIG HORN I  5   0.998282523
BIG HORN I  6   0.681460011
BIG HORN I  7   0.907152074
BIG HORN I  8   0.66212225
BIG HORN I  9   0.98807953
BIG HORN I  10  0.988469127
BIG HORN I  11  0.990836283
BIG HORN I  12  0.968629237

I cannot understand why the apply using the fuction is incorrect. Thank you for your help.


Solution

  • Here is the computation of R squared, RMSE and Pearson correlation coefficient (as used in Excel) on your data:

    from sklearn.metrics import r2_score, mean_squared_error
    from scipy.stats import pearsonr
    def r2_rmse2(g):
        r2 = r2_score(g['wind_speed_obs'], g['wind_speed_ms'])
        rmse = mean_squared_error(g['wind_speed_obs'], g['wind_speed_ms'], squared=False)
        correl = pearsonr(g['wind_speed_obs'], g['wind_speed_ms'])[0]
        return pd.Series( dict(  r2 = r2, rmse=rmse, correl=correl ) )
    data1.groupby( ['plant_name','month'] ).apply( r2_rmse2 ).reset_index()
    
        plant_name  month         r2      rmse    correl
    0   BIG HORN I      1  -0.314771  0.976090  0.999975
    1   BIG HORN I      2   0.529890  1.072639  0.998460
    2   BIG HORN I      3   0.804066  0.592633  0.988712
    3   BIG HORN I      4 -22.164844  0.549141  0.711649
    4   BIG HORN I      5  -0.460691  0.866068  0.998283
    5   BIG HORN I      6  -4.673359  0.729833  0.681460
    6   BIG HORN I      7  -0.662167  0.553450  0.907152
    7   BIG HORN I      8  -2.118817  0.716380  0.662122
    8   BIG HORN I      9  -1.946562  0.539102  0.988080
    9   BIG HORN I     10   0.662637  0.630426  0.988469
    10  BIG HORN I     11   0.696896  0.428632  0.990836
    11  BIG HORN I     12   0.446234  0.519437  0.968629