Search code examples
pythonpython-3.xmachine-learningscikit-learnfeature-selection

Code enters infinite loop when trying to select features


I am trying to use scikit learn's Recursive feature elimination with cross-validation for a (5000, 37) data that has binary class problem and whenever i fit the model the algorithm enters infinite loop. Currently, i am following this example: https://scikit-learn.org/stable/auto_examples/feature_selection/plot_rfe_with_cross_validation.html on how to employ this algorithm.

My data is:

    from sklearn.svm import SVC
    from sklearn.model_selection import StratifiedKFold
    from sklearn.feature_selection import RFECV
    
        X = np.random.randint(0,363175645.191632,size=(5000, 37))
        Y = np.random.choice([0, 1], size=(37,))

What i tried doing to select the features by:

    svc = SVC(kernel="linear")
    rfecv = RFECV(estimator=svc, step=1, cv=StratifiedKFold(2),
                  scoring='accuracy')
    
    rfecv.fit(X, Y)

The code hangs and enters infinite loop, however when i try using another algorithm such as ExtraTreesClassifier it works just fine, what is going on, please help?


Solution

  • When you perform svm, because it is distance based, it makes sense to scale your feature variables, especially in your case when they are huge. you can also check out this intro to svm. Using an example dataset:

    from sklearn.datasets import make_blobs
    import seaborn as sns
    import numpy as np
    from sklearn.preprocessing import StandardScaler
    
    Scaler =  StandardScaler()
    
    X, y = make_blobs(n_samples=5000, centers=3, shuffle=False,random_state=42)
    X = np.concatenate((X,np.random.randint(0,363175645.191632,size=(5000,35))),axis=1)
    y = (y==1).astype('int')
    
    X_scaled = Scaler.fit_transform(X)
    

    This dataset has only 2 useful variables in the first two columns, as you can see from the plot:

    plt.scatter(x=X_scaled[:,0],y=X_scaled[:,1],c=['k' if i else 'b' for i in y])
    

    enter image description here

    Now we run rfe on scaled data and we can see it returns the first two columns as top variables:

    from sklearn.svm import SVC
    from sklearn.model_selection import StratifiedKFold
    from sklearn.feature_selection import RFECV
    
    svc = SVC(kernel="linear")
    rfecv = RFECV(estimator=svc, step=1, cv=StratifiedKFold(2),scoring='accuracy')
    rfecv.fit(X_scaled, y)
    
    rfecv.ranking_
    
    array([ 1,  2, 17, 28, 33, 22, 23, 26,  6, 19, 20,  4, 10, 25,  3, 27, 11,
            8, 18,  5, 29, 14,  7, 21,  9, 13, 24, 30, 35, 31, 32, 34, 16, 36,
           37, 12, 15])