I am trying feature selection on the Iris dateset.
I'm referencing from Feature Selection with Univariate Statistical Tests
I am using below lines and I want to find out the significant features:
import pandas
from pandas import read_csv
from numpy import set_printoptions
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_classif
dataframe = pandas.read_csv("C:\\dateset\\iris.csv"]))
array = dataframe.values
X = array[:,0:4]
Y = array[:,4]
test = SelectKBest(score_func=f_classif, k=2)
fit = test.fit(X, Y)
set_printoptions(precision=2)
arr = fit.scores_
print (arr)
# [ 119.26 47.36 1179.03 959.32]
To show the indexes of the top 2 by its score, I added:
idx = (-arr).argsort()[:2]
print (idx)
# [2 3]
Further, how can I have the column/variable names (instead of their indexes)?
Use indexing, here is possible use columns names, because selected first 4 columns:
#first 4 columns
X = array[:,0:4]
cols = dataframe.columns[idx]
If selection is different for X
variable is necessary also filter by position DataFrame:
#e.g. selected 3. to 7. column
X = array[:,2:6]
cols = dataframe.iloc[:, 2:6].columns[idx]