Search code examples
pythonpandaschaining

Are intermediate results piped using _ in chained operations available to subsequent functions in the chain?


I am creating a correlation matrix from which I want to obtain the max positive correlation value. Applying max() to the corr() results will just return 1.0 for the correlations along the axis, which is not desired, and so the objective is to remove all occurrences of 1.0 and then run the max(). I was thinking to do this in a chained operation, and can do it using _ to pipe intermediate results to the where() operation, which does turn 1.0 into NaNs. However, applying max() as the next operation in the chain still returns 1.0 as though it is ignoring the results of the where().

Is there something I'm not understanding with the _ operator? Or perhaps where() is the wrong function in this context? I have provided full code below to reproduce the question.

# Set up the problem

import pandas as pd
import numpy as np

# raw data

raw_t = [
66.6, 36.4, 47.6, 17.0, 54.6, 21.0, 12.2, 13.6, 20.6, 55.4, 63.4, 69.0,
80.2, 26.2, 42.6, 31.8, 15.6, 27.8, 13.8, 22.0, 14.2, 62.6, 96.4, 113.8,
115.2,82.2, 65.0, 23.2, 24.0, 14.2,  1.4,  3.8, 16.4, 16.4, 67.0, 51.4
]

# raw indexes

yr_mn = (np.full(12, 2000).tolist() + np.full(12, 2001).tolist() + np.full(12, 2002).tolist(),
np.arange(1,13).tolist() + np.arange(1,13).tolist() + np.arange(1,13).tolist() )

# structure multi index

index_base = list(zip(*yr_mn))
index = pd.MultiIndex.from_tuples(index_base, names=["year", "month"])

# create indexed dataset

t_dat = pd.Series(raw_t, index=index)

# example of the correlation matrix we are working with

pd.set_option("format.precision", 2)
t_dat.unstack().corr().style.background_gradient(cmap="YlGnBu")

And my attempts:


t_dat.unstack().corr().stack().where(_!=1.0) # does swap out 1.0 for NaN  
t_dat.unstack().corr().stack().where(_!=1.0).max() # still returns 1.0

Another point is that it will sometimes work, but sometimes it doesn't, returning ValueError: Array conditional must be same shape as self

This also makes me suspicious that I am missing something. The default setting of panda's max() is to skip NaNs, so it shouldn't have anything to do with that. I also tried setting the 1.0 to 0.0 using where(_!=1.0,0.0); same result. Also, I found the ValueError can be overcome if I rem out the where and rerun, as:


t_dat.unstack().corr().stack()#.where(\_!=1.0)

This somehow resets it, even though the original dataframe is not being altered.

Thanks for any insights! David


Solution

  • Don't use _ in interactive environments - this contains the result of last command (it could work but eventually it will break).

    You can do this:

    # store the result to a variable:
    result = t_dat.unstack().corr().stack()
    
    # compute the boolean mask and set the True values to NaN
    mask = result == 1.0
    result[mask] = np.nan
    
    print(result)
    

    Prints:

    
    ...
    11     1       -0.148800
           2       -0.561202
           3       -0.595797
           4        0.945831
           5       -0.737437
           6        0.812018
           7        0.516614
           8        0.785324
           9       -0.823919
           10       0.539078
           11            NaN
           12       0.929903
    12     1       -0.502081
           2       -0.826288
           3       -0.849431
           4        0.760119
           5       -0.437322
           6        0.969761
           7        0.795323
           8        0.957978
           9       -0.557725
           10       0.811077
           11       0.929903
           12            NaN
    dtype: float64
    

    Then you can compute the max:

    print(result.max())
    

    Prints:

    0.9996502197746994