Search code examples
pythonpandascorrelationtabular

How to select columns that are highly correlated with one specific column in a dataframe


I have a dataframe which has over 100 columns, with which I am trying to build a model. In this case, one column (A) in this dataframe is considered as a response and all the other columns (B,C,D, etc.) are predictors. So I am trying to select all the columns that are correlated to column A based on correlation factor (say >0.2). I already generated a heatmap with all the correlation factors between each pair of the columns. But can I have a quick method in pandas to get all the columns with a collrelation factor over 0.2 (which I will adjust of course if needed) to column A? Thanks in advance!


Solution

  • Use the DataFrame to calculate the correlation, then slice the columns by your cut-off condition with a Boolean mask.

    import pandas as pd
    df = pd.DataFrame({'A': [1,2,3,4,5,6,7,8,9,10],
                       'B': [1,2,4,3,5,7,6,8,10,11], 
                       'C': [15,-1,17,-10,-10,-13,-99,-101,0,0],
                       'D': [0,10,0,0,-10,0,0,-10,0,10]} )
    
    df.loc[:, df.corr()['A'] > 0.2]
    
        A   B
    0   1   1
    1   2   2
    2   3   4
    3   4   3
    4   5   5
    5   6   7
    6   7   6
    7   8   8
    8   9   10
    9   10  11