Search code examples
pythonclassificationrandom-forestpcafeature-selection

Optimal Feature Selection Technique after PCA?


I'm implementing a classification task with binary outcome using RandomForestClassifier and I know the importance of data preprocessing to improve the accuracy score. In particular, my dataset contains more than 100 features and almost 4000 instances and I want to perform a dimensionality reduction technique in order to avoid overfitting since there is an high presence of noise in the data.

For these tasks I usually use a classical Feature Selection method (filters, wrappers, feature importances) but I recently read about combining Principal Component Analysis (PCA) (in a first step) and then Feature selection on the transformed dataset.

My question is the following: is there a specific feature selection method that I should use after having performed PCA on my data? In particular, what I want to understand is whether the use of PCA on my data make the use of some particular Feature selection Technique useless or less efficient.


Solution

  • Let's begin with when should we use PCA?

    The PCA is most useful when you are not sure which component of your data is affecting the accuracy.

    Let's think about the face-recognition task. Can we say the most crucial pixels at a glance?

    For instance: Olivetti faces. 40 people, dark homogenous background, varying the lighting, facial expressions (open/closed eyes, smiling / not smiling), and facial details (glasses / no glasses).

    So if we look at the correlations between the pixels:

    from sklearn.datasets import fetch_olivetti_faces
    from numpy import corrcoef
    from numpy import zeros_like
    from numpy import triu_indices_from
    from matplotlib.pyplot import figure
    from matplotlib.pyplot import get_cmap
    from matplotlib.pyplot import plot
    from matplotlib.pyplot import colorbar
    from matplotlib.pyplot import subplots
    from matplotlib.pyplot import suptitle
    from matplotlib.pyplot import imshow
    from matplotlib.pyplot import xlabel
    from matplotlib.pyplot import ylabel
    from matplotlib.pyplot import savefig
    from matplotlib.image import imread
    import seaborn
    
    
    olivetti = fetch_olivetti_faces()
    
    X = olivetti.images  # Train
    y = olivetti.target  # Labels
    
    X = X.reshape((X.shape[0], X.shape[1] * X.shape[2]))
    
    seaborn.set(font_scale=1.2)
    seaborn.set_style("darkgrid")
    mask = zeros_like(corrcoef(X_resp))
    mask[triu_indices_from(mask)] = True
    with seaborn.axes_style("white"):
        f, ax = subplots(figsize=(20, 15))
        ax = seaborn.heatmap(corrcoef(X), 
                             annot=True, 
                             mask=mask, 
                             vmax=1,
                             vmin=0,
                             square=True, 
                             cmap="YlGnBu",
                             annot_kws={"size": 1})
        
    savefig('heatmap.png')
    

    enter image description here

    From above can you tell me which pixels are the most important for the classification?

    However, if I ask you, "Could you please tell me the most important features for chronic kidney disease?"

    You can tell me at a glance:

    enter image description here

    If we resume from the face-recognition task, do we need really all the pixels for the classification?

    No, we don't.

    enter image description here

    Above you can see only 63 pixels suffice for to recognize a face as a human.

    Please note that 63 pixes suffice for to recognize a face, not face-recognition. You need more pixels for the discrimination between faces.

    So what we do is reducing the dimensionality. You might want to read more on about the Curse of dimensionality

    Ok, so we decide to use PCA, since we don't need each pixel of the face image. We have to reduce the dimension.

    To make visually understandable, I'm using 2 dimension.

    def projection(obj, x, x_label, y_label, title, class_num=40, sample_num=10, dpi=300):
        x_obj = obj.transform(x)
        idx_range = class_num * sample_num
        fig = figure(figsize=(6, 3), dpi=dpi)
        ax = fig.add_subplot(1, 1, 1)
        c_map = get_cmap(name='jet', lut=class_num)
        scatter = ax.scatter(x_obj[:idx_range, 0], x_obj[:idx_range, 1], c=y[:idx_range],
                             s=10, cmap=c_map)
        ax.set_xlabel(x_label)
        ax.set_ylabel(y_label)
        ax.set_title(title.format(class_num))
        colorbar(mappable=scatter)
        
    
    
    pca_obj = PCA(n_components=2).fit(X)
    x_label = "First Principle Component"
    y_label = "Second Principle Component"
    title = "PCA Projection of {} people"
    projection(obj=pca_obj, x=X, x_label=x_label, y_label=y_label, title=title)
    

    enter image description here

    As you can see, PCA with 2 components isn't sufficient to discriminate.

    So how many components do you need?

    def display_n_components(obj):
        figure(1, figsize=(6,3), dpi=300)
        plot(obj.explained_variance_, linewidth=2)
        xlabel('Components')
        ylabel('Explained Variaces')
    
    
    pca_obj2 = PCA().fit(X)
    display_n_components(pca_obj2)
    

    enter image description here

    You need 100 components for good discrimination.

    Now we need to split the train and test set.

    from sklearn.model_selection import train_test_split
    from sklearn.ensemble import RandomForestClassifier
    from sklearn.decomposition import PCA
    from sklearn.metrics import accuracy_score
    
    
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
    
    X_train = X_train.reshape((X_train.shape[0], X.shape[1] * X.shape[2])) 
    X_test = X_test.reshape((X_test.shape[0], X.shape[1] * X.shape[2]))
    
    pca = PCA(n_components=100).fit(X)
    X_pca_tr = pca.transform(X_train)
    X_pca_te = pca.transform(X_test)
    
    forest1 = RandomForestClassifier(random_state=42)
    forest1.fit(X_pca_tr, y_train)
    y_pred = forest1.predict(X_pca_te)
    print("\nAccuracy:{:,.2f}%".format(accuracy_score(y_true=y_test, y_pred=y_pred_)*100))
    

    The accuracy will be:

    enter image description here

    You might wonder, does PCA improves the accuracy?

    The answer is Yes.

    Without PCA : enter image description here