I am trying to use scikit learn's Recursive feature elimination with cross-validation for a (5000, 37)
data that has binary class problem and whenever i fit the model the algorithm enters infinite loop.
Currently, i am following this example: https://scikit-learn.org/stable/auto_examples/feature_selection/plot_rfe_with_cross_validation.html on how to employ this algorithm.
My data is:
from sklearn.svm import SVC
from sklearn.model_selection import StratifiedKFold
from sklearn.feature_selection import RFECV
X = np.random.randint(0,363175645.191632,size=(5000, 37))
Y = np.random.choice([0, 1], size=(37,))
What i tried doing to select the features by:
svc = SVC(kernel="linear")
rfecv = RFECV(estimator=svc, step=1, cv=StratifiedKFold(2),
scoring='accuracy')
rfecv.fit(X, Y)
The code hangs and enters infinite loop, however when i try using another algorithm such as ExtraTreesClassifier it works just fine, what is going on, please help?
When you perform svm, because it is distance based, it makes sense to scale your feature variables, especially in your case when they are huge. you can also check out this intro to svm. Using an example dataset:
from sklearn.datasets import make_blobs
import seaborn as sns
import numpy as np
from sklearn.preprocessing import StandardScaler
Scaler = StandardScaler()
X, y = make_blobs(n_samples=5000, centers=3, shuffle=False,random_state=42)
X = np.concatenate((X,np.random.randint(0,363175645.191632,size=(5000,35))),axis=1)
y = (y==1).astype('int')
X_scaled = Scaler.fit_transform(X)
This dataset has only 2 useful variables in the first two columns, as you can see from the plot:
plt.scatter(x=X_scaled[:,0],y=X_scaled[:,1],c=['k' if i else 'b' for i in y])
Now we run rfe on scaled data and we can see it returns the first two columns as top variables:
from sklearn.svm import SVC
from sklearn.model_selection import StratifiedKFold
from sklearn.feature_selection import RFECV
svc = SVC(kernel="linear")
rfecv = RFECV(estimator=svc, step=1, cv=StratifiedKFold(2),scoring='accuracy')
rfecv.fit(X_scaled, y)
rfecv.ranking_
array([ 1, 2, 17, 28, 33, 22, 23, 26, 6, 19, 20, 4, 10, 25, 3, 27, 11,
8, 18, 5, 29, 14, 7, 21, 9, 13, 24, 30, 35, 31, 32, 34, 16, 36,
37, 12, 15])