Search code examples
pythonpandaspearson-correlation

Filtering row values in pandas by threshold value


I have a pandas correlation matrix dataframe that has hundreds of columns and rows. I want to filter the whole dataframe so that i only get cells that are above a certain value, any row value > .4, for example. I'm not sure what is the best way to do that. I've seen multiple posts that filter by column name, but with hundreds of columns, how would you iterate through each column?


Solution

  • If want filter only rows use boolean indexing with mask created by gt > with any for check at least one True:

    df[df.gt(.4).any(axis=1)]
    

    Or if want filter rows and columns:

    m = df.gt(.4)
    df.loc[m.any(axis=1), m.any()]
    

    If want only values by condition is possible use where for replace another values to NaNs:

    df.where(df.gt(.4))
    

    Sample:

    np.random.seed(4137)
    df = pd.DataFrame(np.random.rand(5,3), columns=list('ABC'))
    df['A'] /=  10
    print (df)
              A         B         C
    0  0.090262  0.313517  0.319292
    1  0.086841  0.346788  0.314144
    2  0.044069  0.764295  0.886659
    3  0.089739  0.230103  0.594301
    4  0.047101  0.132819  0.692846
    
    df1 = df[df.gt(.4).any(axis=1)]
    print (df1)
              A         B         C
    2  0.044069  0.764295  0.886659
    3  0.089739  0.230103  0.594301
    4  0.047101  0.132819  0.692846
    
    m = df.gt(.4)
    df2 = df.loc[m.any(axis=1), m.any()]
    print (df2)
              B         C
    2  0.764295  0.886659
    3  0.230103  0.594301
    4  0.132819  0.692846
    
    df3 = df.where(df.gt(.4))
    print (df3)
        A         B         C
    0 NaN       NaN       NaN
    1 NaN       NaN       NaN
    2 NaN  0.764295  0.886659
    3 NaN       NaN  0.594301
    4 NaN       NaN  0.692846
    

    Detail:

    print (m)
           A      B      C
    0  False  False  False
    1  False  False  False
    2  False   True   True
    3  False  False   True
    4  False  False   True
    
    print (m.any(axis=1))
    0    False
    1    False
    2     True
    3     True
    4     True
    dtype: bool
    
    print (m.any())
    A    False
    B     True
    C     True
    dtype: bool