Search code examples
pythonpandasdataframefeature-selection

Printing column/variable names after feature selection


I am trying feature selection on the Iris dateset.

I'm referencing from Feature Selection with Univariate Statistical Tests

I am using below lines and I want to find out the significant features:

import pandas
from pandas import read_csv
from numpy import set_printoptions
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_classif

dataframe = pandas.read_csv("C:\\dateset\\iris.csv"]))
array = dataframe.values
X = array[:,0:4]
Y = array[:,4]

test = SelectKBest(score_func=f_classif, k=2)
fit = test.fit(X, Y)

set_printoptions(precision=2)
arr = fit.scores_

print (arr)

# [ 119.26   47.36 1179.03  959.32]

To show the indexes of the top 2 by its score, I added:

idx = (-arr).argsort()[:2]
print (idx)

# [2 3]

Further, how can I have the column/variable names (instead of their indexes)?


Solution

  • Use indexing, here is possible use columns names, because selected first 4 columns:

    #first 4 columns
    X = array[:,0:4]
    
    cols = dataframe.columns[idx]
    

    If selection is different for X variable is necessary also filter by position DataFrame:

    #e.g. selected 3. to 7. column
    X = array[:,2:6]
    
    cols = dataframe.iloc[:, 2:6].columns[idx]