I am creating a correlation matrix from which I want to obtain the max positive correlation value. Applying max() to the corr()
results will just return 1.0 for the correlations along the axis, which is not desired, and so the objective is to remove all occurrences of 1.0 and then run the max()
. I was thinking to do this in a chained operation, and can do it using _ to pipe intermediate results to the where()
operation, which does turn 1.0 into NaNs. However, applying max()
as the next operation in the chain still returns 1.0 as though it is ignoring the results of the where()
.
Is there something I'm not understanding with the _ operator? Or perhaps where()
is the wrong function in this context? I have provided full code below to reproduce the question.
# Set up the problem
import pandas as pd
import numpy as np
# raw data
raw_t = [
66.6, 36.4, 47.6, 17.0, 54.6, 21.0, 12.2, 13.6, 20.6, 55.4, 63.4, 69.0,
80.2, 26.2, 42.6, 31.8, 15.6, 27.8, 13.8, 22.0, 14.2, 62.6, 96.4, 113.8,
115.2,82.2, 65.0, 23.2, 24.0, 14.2, 1.4, 3.8, 16.4, 16.4, 67.0, 51.4
]
# raw indexes
yr_mn = (np.full(12, 2000).tolist() + np.full(12, 2001).tolist() + np.full(12, 2002).tolist(),
np.arange(1,13).tolist() + np.arange(1,13).tolist() + np.arange(1,13).tolist() )
# structure multi index
index_base = list(zip(*yr_mn))
index = pd.MultiIndex.from_tuples(index_base, names=["year", "month"])
# create indexed dataset
t_dat = pd.Series(raw_t, index=index)
# example of the correlation matrix we are working with
pd.set_option("format.precision", 2)
t_dat.unstack().corr().style.background_gradient(cmap="YlGnBu")
And my attempts:
t_dat.unstack().corr().stack().where(_!=1.0) # does swap out 1.0 for NaN
t_dat.unstack().corr().stack().where(_!=1.0).max() # still returns 1.0
Another point is that it will sometimes work, but sometimes it doesn't, returning
ValueError: Array conditional must be same shape as self
This also makes me suspicious that I am missing something. The default setting of panda's max()
is to skip NaNs, so it shouldn't have anything to do with that. I also tried setting the 1.0 to 0.0 using where(_!=1.0,0.0)
; same result. Also, I found the ValueError can be overcome if I rem out the where and rerun, as:
t_dat.unstack().corr().stack()#.where(\_!=1.0)
This somehow resets it, even though the original dataframe is not being altered.
Thanks for any insights! David
Don't use _
in interactive environments - this contains the result of last command (it could work but eventually it will break).
You can do this:
# store the result to a variable:
result = t_dat.unstack().corr().stack()
# compute the boolean mask and set the True values to NaN
mask = result == 1.0
result[mask] = np.nan
print(result)
Prints:
...
11 1 -0.148800
2 -0.561202
3 -0.595797
4 0.945831
5 -0.737437
6 0.812018
7 0.516614
8 0.785324
9 -0.823919
10 0.539078
11 NaN
12 0.929903
12 1 -0.502081
2 -0.826288
3 -0.849431
4 0.760119
5 -0.437322
6 0.969761
7 0.795323
8 0.957978
9 -0.557725
10 0.811077
11 0.929903
12 NaN
dtype: float64
Then you can compute the max
:
print(result.max())
Prints:
0.9996502197746994