python pandas dataframe scikit-learn feature-selection

How to select best features in Dataframe using the Information Gain measure in scikit-learn

I want to identify the 10 best features of a Dataframe using the Information Gain measure (Mutual Info in scikit-learn) and display them in a table (in ascending order according to the score obtained by the Information Gain).

In this example, features is the Dataframe that contains all the interesting training data that could tell if a restaurant will close or not.

# Initialization of data and labels
x = features.copy () # "x" contains all training data
y = x ["closed"] # "y" contains the labels of the records in "x"

# Elimination of the class column (closed) of features
x = x.drop ('closed', axis = 1)

# this is x.columns, sorry for the mix french and english
features_columns = ['moyenne_etoiles', 'ville', 'zone', 'nb_restaurants_zone',
       'zone_categories_intersection', 'ville_categories_intersection',
       'nb_restaurant_meme_annee', 'ecart_type_etoiles', 'tendance_etoiles',
       'nb_avis', 'nb_avis_favorables', 'nb_avis_defavorables',
       'ratio_avis_favorables', 'ratio_avis_defavorables',
       'nb_avis_favorables_mention', 'nb_avis_defavorables_mention',
       'nb_avis_favorables_elites', 'nb_avis_defavorables_elites',
       'nb_conseils', 'nb_conseils_compliment', 'nb_conseils_elites',
       'nb_checkin', 'moyenne_checkin', 'annual_std', 'chaine',
       'nb_heures_ouverture_semaine', 'ouvert_samedi', 'ouvert_dimanche',
       'ouvert_lundi', 'ouvert_vendredi', 'emporter', 'livraison',
       'bon_pour_groupes', 'bon_pour_enfants', 'reservation', 'prix',
       'terrasse']

# normalization
std_scale = preprocessing.StandardScaler().fit(features[features_columns])
normalized_data = std_scale.transform(features[features_columns])
labels = np.array(features['closed'])

# split the data 
train_features, test_features, train_labels, test_labels = train_test_split(normalized_data, labels, test_size = 0.2, random_state = 42)

labels_true = ?
labels_pred = ?

# I dont really know how to use this function to achieve what i want
from sklearn.feature_selection import mutual_info_classif
from sklearn.datasets import make_classification



# Get the mutual information coefficients and convert them to a data frame
coeff_df =pd.DataFrame(features,
                         columns=['Coefficient'], index=x.columns)

coeff_df.head()

What is the correct syntax using Mutual Info score to achieve this?

Solution

The adjusted_mutual_info_score compares ground truth labels with labels predictions from a classifier. Both label arrays must have the same shape (nsamples,).

You need Scikit-Learn's mutual_info_classif for what you are trying to achieve. Pass the array of features and the corresponding labels to mutual_info_classif to get back the estimated mutual information between each feature and the target.

import numpy as np
import pandas as pd

from sklearn.feature_selection import mutual_info_classif
from sklearn.datasets import make_classification

# Generate a sample data frame
X, y = make_classification(n_samples=1000, n_features=4,
                           n_informative=2, n_redundant=2,
                           random_state=0, shuffle=False)
feature_columns = ['A', 'B', 'C', 'D']
features = pd.DataFrame(X, columns=feature_columns)

# Get the mutual information coefficients and convert them to a data frame
coeff_df =pd.DataFrame(mutual_info_classif(X, y).reshape(-1, 1),
                         columns=['Coefficient'], index=feature_columns)

Output

features.head(3)
Out[43]: 
          A         B         C         D
0 -1.668532 -1.299013  0.799353 -1.559985
1 -2.972883 -1.088783  1.953804 -1.891656
2 -0.596141 -1.370070 -0.105818 -1.213570

# Displaying only the top two features. Adjust the number as required.
coeff_df.sort_values(by='Coefficient', ascending=False)[:2]

Out[44]: 
   Coefficient
B     0.523911
D     0.366884