Search code examples
pythonpandasmachine-learningdecision-tree

how to find any attributes with correlation more than 0.50


After analyzing the dataset, how can we find the correlation of all attributes?

correlations = data.corr(method='pearson')

print(correlation>=0.50)

I'm not getting the proper output.


Solution

  • Data:

    import pprint
    np.random.seed(4)
    df = pd.DataFrame(np.random.standard_normal((1000, 5)))
    df.columns = list("ABCDE")
    df_cor = df.corr(method='pearson')
    

    df.head():

              A         B         C         D         E
    0  0.050562  0.499951 -0.995909  0.693599 -0.418302
    1 -1.584577 -0.647707  0.598575  0.332250 -1.147477
    2  0.618670 -0.087987  0.425072  0.332253 -1.156816
    3  0.350997 -0.606887  1.546979  0.723342  0.046136
    4 -0.982992  0.054433  0.159893 -1.208948  2.223360
    

    df_cor:

              A         B         C         D         E
    A  1.000000 -0.008658 -0.015977 -0.001219 -0.008043
    B -0.008658  1.000000  0.037419 -0.055335  0.057751
    C -0.015977  0.037419  1.000000  0.000049  0.057091
    D -0.001219 -0.055335  0.000049  1.000000 -0.017879
    E -0.008043  0.057751  0.057091 -0.017879  1.000000
    
    # Checking for correlations > Absulute 0.05. Here i `0.05`, change it to `0.5` at your end.
    
    df_cor[df_cor.abs() > .05].dropna(axis=1, how='all').replace(1., np.nan).dropna(how='all', axis=1).dropna(how='all', axis=0).apply(lambda x:x.dropna().to_dict() ,axis=1).to_dict()
    
    {'B': {'D': -0.0553348494117175, 'E': 0.057751329924049855},
     'C': {'E': 0.057091148280687266},
     'D': {'B': -0.0553348494117175},
     'E': {'B': 0.057751329924049855, 'C': 0.057091148280687266}}
    

    if you need dataframe output:

    df_cor[df_cor.abs() > .05].replace(1, np.nan)
    
        A         B         C         D         E
    A NaN       NaN       NaN       NaN       NaN
    B NaN       NaN       NaN -0.055335  0.057751
    C NaN       NaN       NaN       NaN  0.057091
    D NaN -0.055335       NaN       NaN       NaN
    E NaN  0.057751  0.057091       NaN       NaN
    

    after dropping columns where is no value:

    df_cor[df_cor.abs() > .05].replace(1, np.nan).dropna(how='all', axis=1)
    
              B         C         D         E
    A       NaN       NaN       NaN       NaN
    B       NaN       NaN -0.055335  0.057751
    C       NaN       NaN       NaN  0.057091
    D -0.055335       NaN       NaN       NaN
    E  0.057751  0.057091       NaN       NaN