Search code examples
pythonscipycorrelationminimizationscipy-optimize

Optimize changing variables to get max Pearson's correlation coefficient for multiple columns


Amendment:

If I have a pandas DataFrame that includes 5 columns Col1 & Col2 & Col3 & Col4 & Col5 and I need to get max Pearson's correlation coefficient between(Col2,Col3) & (Col2,Col4) & (Col2,Col5) by considering the values in Col1

The modified values for Col2 which obtained by the next formula:

df['Col1']=np.power((df['Col1']),B)
df['Col2']=df['Col2']*df['Col1']

where B is the changing variable (a single value) to get max Pearson's correlation coefficient between (new values of Col2,Col3) & (new values of Col2,Col4) & (new values of Col2,Col5).

Update:

enter image description here

The above table containing 5 columns as I mentioned above, the correlation between coefficient between (Col2,Col3) & (Col2,Col4) & (Col2,Col5) is illustrated below the table.

I need to change the values of Col2 based on two the mentioned equations where the changing value is B.

So the question is how to get the best value of B that gives a new correlation coefficient greater than or equal its counterpart(old)?

enter image description here

Update 2 :

Col1,Col2,Col3,Col4,Col5

2,0.051361397,2618,1453,1099

4,0.053507779,306,153,150

2,0.041236151,39,54,34

6,0.094526419,2755,2209,1947

4,0.079773397,2313,1261,1022

4,0.083891415,3528,2502,2029

6,0.090737243,3594,2781,2508

2,0.069552772,370,234,246

2,0.052401789,690,402,280

2,0.039930675,1218,846,631

4,0.065952096,1706,523,453

2,0.053064126,314,197,123

6,0.076847486,4019,1675,1452

2,0.044881545,604,402,356

2,0.073102611,2214,1263,1050

0,0.046998526,938,648,572


Solution

  • Not extremely elegant, but works; feel free to make this more generic:

    import pandas as pd
    from scipy.optimize import minimize
    
    
    def minimize_me(b, df):
    
        # we want to maximize, so we have to multiply by -1
        return -1 * df['Col3'].corr(df['Col2'] * df['Col1'] ** b )
    
    # read your dataframe from somehwere, e.g. csv
    df = pd.read_clipboard(sep=',')
    
    # B is greater than 0 for now
    bnds = [(0, None)]
    
    res = minimize(minimize_me, (1), args=(df,), bounds=bnds)
    
    if res.success:
        # that's the optimal B
        print(res.x[0])
    
        # that's the highest correlation you can get
        print(-1 * res.fun)
    else:
        print("Sorry, the optimization was not successful. Try with another initial"
              " guess or optimization method")
    

    This will print:

    0.9020784246026575 # your B
    0.7614993786787415 # highest correlation for corr(col2, col3)
    

    I now read from clipboard, replace that by your .csv file. You should then also avoid the hardcoding of the columns; the code above is just for demonstration purposes, so that you see how to set up the optimization problem itself.

    If you are interested in the sum, you can use (rest of code unmodified):

    def minimize_me(b, df):
    
        col_mod = df['Col2'] * df['Col1'] ** b
    
        # we want to maximize, so we have to multiply by -1
        return -1 * (df['Col3'].corr(col_mod) +
                     df['Col4'].corr(col_mod) +
                     df['Col5'].corr(col_mod))
    

    This will print:

    1.0452394748131613
    2.3428368479642137