Search code examples
pythonmachine-learningfeature-selection

Why do I get two different values in heatmap and feature_importances?


I'm running a feature selection using sns.heatmap and one using sklearn feature_importances.

When using the same data I get two difference values.

Here is the heatmapenter image description here

and heatmap code

from matplotlib import pyplot as plt
import pandas as pd
import numpy as np
import seaborn as sns

training_data = pd.read_csv(
    "/Users/aus10/NFL/Data/Betting_Data/CBB/Training_Data_Betting_CBB.csv")

df_model = training_data.copy()
df_model = df_model.dropna()
df_model = df_model.drop(['Money_Line', 'Money_Line_Percentage', 'Money_Line_Money', 'Money_Line_Move', 'Money_Line_Direction', "Spread", 'Spread_Percentage', 'Spread_Money', 'Spread_Move', 'Spread_Direction',
                          "Win", "Money_Line_Percentage", 'Cover'], axis=1)

X = df_model.loc[:, ['Total', 'Total_Move', 'Over_Percentage', 'Over_Money',
                     'Under_Percentage', 'Under_Money']]  # independent columns
y = df_model['Over_Under']  # target column

# get correlations of each features in dataset
corrmat = df_model.corr()
top_corr_features = corrmat.index
plt.figure(figsize=(20, 20))
# plot heat map
g = sns.heatmap(
    df_model[top_corr_features].corr(), annot=True, cmap='hot')

plt.xticks(rotation=90)
plt.yticks(rotation=45)

plt.show()

Here is the feature_importances bar graph enter image description here

and the code

import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier
import matplotlib.pyplot as plt
from sklearn.model_selection import StratifiedKFold
from sklearn.inspection import permutation_importance

training_data = pd.read_csv(
    "/Users/aus10/NFL/Data/Betting_Data/CBB/Training_Data_Betting_CBB.csv", index_col=False)

df_model = training_data.copy()
df_model = df_model.dropna()

X = df_model.loc[:, ['Total', 'Total_Move', 'Over_Percentage', 'Over_Money',
                     'Under_Percentage', 'Under_Money']]  # independent columns
y = df_model['Over_Under']  # target column

model = RandomForestClassifier(
    random_state=1, n_estimators=100, min_samples_split=100, max_depth=5, min_samples_leaf=2)

skf = StratifiedKFold(n_splits=2)

skf.get_n_splits(X, y)

StratifiedKFold(n_splits=2, random_state=None, shuffle=False)

for train_index, test_index in skf.split(X, y):
    X_train, X_test = X.iloc[train_index], X.iloc[test_index]
    y_train, y_test = y.iloc[train_index], y.iloc[test_index]

model.fit(X_train, y_train)
# use inbuilt class feature_importances of tree based classifiers
print(model.feature_importances_)
# plot graph of feature importances for better visualization
feat_importances = pd.Series(model.feature_importances_, index=X.columns)
perm_importance = permutation_importance(model, X_test, y_test)
feat_importances.nlargest(5).plot(kind='barh')

print(perm_importance)

plt.show()

I'm not sure which one is more accurate or if I'm using them in the correct way? Should I being using the heatmap to eliminate collinearity and the feature importances to actually selection my group of features?


Solution

  • You are comparing two different things, why would you expect them to be the same? And what would it even mean in this case?

    Feature importances in tree based models are computed based on how many times given feature was used for splitting. Feature that is used more often for a split is more important (for a particular model fitted with particular dataset) than a feature that is used less often.

    Correlation on the other hand is a measure of linear relationship between 2 features.

    I'm not sure which one is more accurate

    What do you mean by accuracy? Both of these are accurate in what they are measuring. It is just that none of these directly tells you which feature/s to throw away.

    Note that just because 2 features are correlated, it doesn't mean that you can automatically throw one of them away. Collinearity can cause issues with interpretability of the model. If you have highly correlated features, then you can't say which one is more important based on the weights associated with these features. Collinearity should not affect the prediction power of the model. More often, you will find that by throwing away one of the correlated features, your model's prediction power decreases.

    Collinearity in a dataset can therefore make feature importances of your random forrest model less interpretable in a sense that you can't rely on their strict ordering. But again, it should not affect the predictive power of the model (except that the model is more prone to overfitting due to having more degrees of freedom).

    Should I being using the heatmap to eliminate collinearity and the feature importances to actually selection my group of features?

    Feature engineering/selection is more of an art than science (outside of end-to-end deep learning). There is no correct answer here and you will need to develop your own heuristics and try different things to see which one works better in which scenario.

    Example of a simple heuristic based on feature importances and correlation can be (assuming that you have large number of features):

    1. fit the random forrest model and measure the feature importances
    2. throw away those that seem to have no impact on the model (close to 0 importance)
    3. refit the model with the new subset of your original data and see whether the metric of your interest (accuracy, MSE, ...) stays approximately the same as in the step 1.
    4. if you still have a lot of features, you can repeat the step 1-3, increasing the throw-away threshold until your metric of interest starts worsening
    5. measure the correlation of the features that you are left with and select the most correlated pairs (based on some threshold, e.g. (|c| > 0.8))
    6. pick one pair; drop a feature from this pair; measure model performance; return the dropped feature; repeat for each each pair
    7. drop the feature that seems to have the least negative effect on the model's performance based on the results from step 6.
    8. repeat steps 6-7 until the model's performance starts dropping