Amendment:
If I have a pandas DataFrame that includes 5 columns Col1
& Col2
& Col3
& Col4
& Col5
and I need to get max Pearson's correlation coefficient between(Col2
,Col3
) & (Col2
,Col4
) & (Col2
,Col5
) by considering the values in Col1
The modified values for Col2
which obtained by the next formula:
df['Col1']=np.power((df['Col1']),B)
df['Col2']=df['Col2']*df['Col1']
where B
is the changing variable (a single value) to get max Pearson's correlation coefficient between (new values of Col2
,Col3
) & (new values of Col2
,Col4
) & (new values of Col2
,Col5
).
Update:
The above table containing 5 columns as I mentioned above, the correlation between coefficient between (Col2
,Col3
) & (Col2
,Col4
) & (Col2
,Col5
) is illustrated below the table.
I need to change the values of Col2
based on two the mentioned equations where the changing value is B
.
So the question is how to get the best value of B
that gives a new correlation coefficient greater than or equal its counterpart(old)?
Update 2 :
Col1,Col2,Col3,Col4,Col5
2,0.051361397,2618,1453,1099
4,0.053507779,306,153,150
2,0.041236151,39,54,34
6,0.094526419,2755,2209,1947
4,0.079773397,2313,1261,1022
4,0.083891415,3528,2502,2029
6,0.090737243,3594,2781,2508
2,0.069552772,370,234,246
2,0.052401789,690,402,280
2,0.039930675,1218,846,631
4,0.065952096,1706,523,453
2,0.053064126,314,197,123
6,0.076847486,4019,1675,1452
2,0.044881545,604,402,356
2,0.073102611,2214,1263,1050
0,0.046998526,938,648,572
Not extremely elegant, but works; feel free to make this more generic:
import pandas as pd
from scipy.optimize import minimize
def minimize_me(b, df):
# we want to maximize, so we have to multiply by -1
return -1 * df['Col3'].corr(df['Col2'] * df['Col1'] ** b )
# read your dataframe from somehwere, e.g. csv
df = pd.read_clipboard(sep=',')
# B is greater than 0 for now
bnds = [(0, None)]
res = minimize(minimize_me, (1), args=(df,), bounds=bnds)
if res.success:
# that's the optimal B
print(res.x[0])
# that's the highest correlation you can get
print(-1 * res.fun)
else:
print("Sorry, the optimization was not successful. Try with another initial"
" guess or optimization method")
This will print:
0.9020784246026575 # your B
0.7614993786787415 # highest correlation for corr(col2, col3)
I now read from clipboard
, replace that by your .csv
file. You should then also avoid the hardcoding of the columns; the code above is just for demonstration purposes, so that you see how to set up the optimization problem itself.
If you are interested in the sum, you can use (rest of code unmodified):
def minimize_me(b, df):
col_mod = df['Col2'] * df['Col1'] ** b
# we want to maximize, so we have to multiply by -1
return -1 * (df['Col3'].corr(col_mod) +
df['Col4'].corr(col_mod) +
df['Col5'].corr(col_mod))
This will print:
1.0452394748131613
2.3428368479642137