Search code examples
pythonheatmapcorrelationfeature-selection

Correlation coefficient explanation--Feature Selection


How to determine the variables to be removed from our model based on the Correlation coefficient .

See below Example of variables:

Top 10 Absolute Correlations:
  Variable 1      Variable 2        Correlation Value
    pdays           pmonths           1.000000
    emp.var.rate    euribor3m         0.970955
    euribor3m       nr.employed       0.942545
    emp.var.rate    nr.employed       0.899818
    previous        pastEmail         0.798017
    emp.var.rate    cons.price.idx    0.763827
    cons.price.idx  euribor3m         0.670844
    contact         cons.price.idx    0.585899
    previous        nr.employed       0.504471
    cons.price.idx  nr.employed       0.490632

correlation matrix heat map of Independent variables":

Below picture is the correlation matrix heat map of Independent variables

Questions:

1)How to remove the one high correlated variable from Correlation-value calculated between two variables

Ex: correlation value between pdays and pmonths is 1.000000 Which variable to be removed from model ?days or pmonths? How the variable is determined ?

2)What is the correlation threshold range considered to drop a variable?ex:>0.65 or >0.90 etc

3)Can you please interpret above Heat map and give your explanation about the variables to be removed and reason for the same?


Solution

  • You could try to use another selection criteria for choosing between each pair of highly-correlated features. For example you can use the Information Gain (IG), which measures how much information a feature gives about the class (i.e., its reduction of entropy [TAL14], [SIL07]). Once you have detected a pair of highly-correlated features (e.g., as you mentioned pdays and pmonths) you can measure the IG of each variable and keep the one with the highest IG. Nevertheless, there are other selection criteria that you could also apply instead of IG (e.g., Mutual Information Maximization [BHS15]).

    For the threshold, you can choose the value you want (it depends on your problem). However, for playing safe I would select a high value (e.g., 0.95) although you could also consider those ones around 0.94 or 0.9. Moreover, you can always stablish a high value and then play lowering that value to check the performance of your model.

    [TAL14] Jiliang Tang, Salem Alelyani, and Huan Liu. Feature selection for classification: A review, pages 37–64. CRC Press, 1 2014.

    [SIL07] Yvan Saeys, Iñaki Inza, and Pedro Larrañaga. A review of feature selection techniques in bioinformatics. bioinformatics, 23(19):2507–2517, 2007.

    [BHS15] Mohamed Bennasar, Yulia Hicks, Rossitza Setchi. Feature selection using Joint Mutual Information Maximisation. Expert Systems with Applications, 42(22): 8520- 8532, 2015.