Search code examples
pythonpython-3.xmachine-learningstatisticscorrelation

how can I drop low correlated features


I am making a preprocessing code for my LSTM training. My csv contains more than 30 variables. After applying some EDA techniques, I found that half of the features can be drop and they don't make any effect on training.

Right now I am dropping such features manually by using pandas.

I want to make a code which can drop such features automaticlly. I wrote a code to visualize heat map and correlation in this way:

#I am making a class so this part is from preprocessing.
# self.data is a Dataframe which contains all csv data

def calculateCorrelationByPearson(self):
        columns = self.data.columns
        plt.figure(figsize=(12, 8))
        sns.heatmap(data=self.data.corr(method='pearson'), annot=True, fmt='.2f', 
                      linewidths=0.5, cmap='Blues')
        plt.show()
        for column in columns:
            corr = stats.spearmanr(self.data['total'], self.data[columns])
            print(f'{column} - corr coefficient:{corr[0]}, p-value:{corr[1]}')

This gives me a perfect view of my features and relationship with each other.

Now I want to drop columns which are not important. Let's say correlation less than 0.4.

How can I apply this logic in to my code?


Solution

  • Here is an approach to remove variables with a correlation coef value below some threshold:

    import pandas as pd
    from scipy.stats import spearmanr
    
    data = pd.DataFrame([{"A":1, "B":2, "C":3},{"A":2, "B":3, "C":1},{"A":3, "B":4, "C":0},{"A":4, "B":4, "C":1},{"A":5, "B":6, "C":2}])
    targetVar = "A"
    corr_threshold = 0.4
    
    corr = spearmanr(data)
    corrSeries = pd.Series(corr[0][:,0], index=data.columns) #Series with column names and their correlation coefficients
    corrSeries = corrSeries[(corrSeries.index != targetVar) & (corrSeries > corr_threshold)] #apply the threshold
    
    vars_to_keep = list(corrSeries.index.values) #list of variables to keep
    vars_to_keep.append(targetVar)  #add the target variable back in
    data2 = data[vars_to_keep]