python classification random-forest pca feature-selection

Optimal Feature Selection Technique after PCA?

I'm implementing a classification task with binary outcome using RandomForestClassifier and I know the importance of data preprocessing to improve the accuracy score. In particular, my dataset contains more than 100 features and almost 4000 instances and I want to perform a dimensionality reduction technique in order to avoid overfitting since there is an high presence of noise in the data.

For these tasks I usually use a classical Feature Selection method (filters, wrappers, feature importances) but I recently read about combining Principal Component Analysis (PCA) (in a first step) and then Feature selection on the transformed dataset.

My question is the following: is there a specific feature selection method that I should use after having performed PCA on my data? In particular, what I want to understand is whether the use of PCA on my data make the use of some particular Feature selection Technique useless or less efficient.

Solution

Let's begin with when should we use PCA?

The PCA is most useful when you are not sure which component of your data is affecting the accuracy.

Let's think about the face-recognition task. Can we say the most crucial pixels at a glance?

For instance: Olivetti faces. 40 people, dark homogenous background, varying the lighting, facial expressions (open/closed eyes, smiling / not smiling), and facial details (glasses / no glasses).

So if we look at the correlations between the pixels:

from sklearn.datasets import fetch_olivetti_faces
from numpy import corrcoef
from numpy import zeros_like
from numpy import triu_indices_from
from matplotlib.pyplot import figure
from matplotlib.pyplot import get_cmap
from matplotlib.pyplot import plot
from matplotlib.pyplot import colorbar
from matplotlib.pyplot import subplots
from matplotlib.pyplot import suptitle
from matplotlib.pyplot import imshow
from matplotlib.pyplot import xlabel
from matplotlib.pyplot import ylabel
from matplotlib.pyplot import savefig
from matplotlib.image import imread
import seaborn


olivetti = fetch_olivetti_faces()

X = olivetti.images  # Train
y = olivetti.target  # Labels

X = X.reshape((X.shape[0], X.shape[1] * X.shape[2]))

seaborn.set(font_scale=1.2)
seaborn.set_style("darkgrid")
mask = zeros_like(corrcoef(X_resp))
mask[triu_indices_from(mask)] = True
with seaborn.axes_style("white"):
    f, ax = subplots(figsize=(20, 15))
    ax = seaborn.heatmap(corrcoef(X), 
                         annot=True, 
                         mask=mask, 
                         vmax=1,
                         vmin=0,
                         square=True, 
                         cmap="YlGnBu",
                         annot_kws={"size": 1})
    
savefig('heatmap.png')

From above can you tell me which pixels are the most important for the classification?

However, if I ask you, "Could you please tell me the most important features for chronic kidney disease?"

You can tell me at a glance:

If we resume from the face-recognition task, do we need really all the pixels for the classification?

No, we don't.

Above you can see only 63 pixels suffice for to recognize a face as a human.

Please note that 63 pixes suffice for to recognize a face, not face-recognition. You need more pixels for the discrimination between faces.

So what we do is reducing the dimensionality. You might want to read more on about the Curse of dimensionality

Ok, so we decide to use PCA, since we don't need each pixel of the face image. We have to reduce the dimension.

To make visually understandable, I'm using 2 dimension.

def projection(obj, x, x_label, y_label, title, class_num=40, sample_num=10, dpi=300):
    x_obj = obj.transform(x)
    idx_range = class_num * sample_num
    fig = figure(figsize=(6, 3), dpi=dpi)
    ax = fig.add_subplot(1, 1, 1)
    c_map = get_cmap(name='jet', lut=class_num)
    scatter = ax.scatter(x_obj[:idx_range, 0], x_obj[:idx_range, 1], c=y[:idx_range],
                         s=10, cmap=c_map)
    ax.set_xlabel(x_label)
    ax.set_ylabel(y_label)
    ax.set_title(title.format(class_num))
    colorbar(mappable=scatter)
    


pca_obj = PCA(n_components=2).fit(X)
x_label = "First Principle Component"
y_label = "Second Principle Component"
title = "PCA Projection of {} people"
projection(obj=pca_obj, x=X, x_label=x_label, y_label=y_label, title=title)

As you can see, PCA with 2 components isn't sufficient to discriminate.

So how many components do you need?

def display_n_components(obj):
    figure(1, figsize=(6,3), dpi=300)
    plot(obj.explained_variance_, linewidth=2)
    xlabel('Components')
    ylabel('Explained Variaces')


pca_obj2 = PCA().fit(X)
display_n_components(pca_obj2)

You need 100 components for good discrimination.

Now we need to split the train and test set.

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.decomposition import PCA
from sklearn.metrics import accuracy_score


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

X_train = X_train.reshape((X_train.shape[0], X.shape[1] * X.shape[2])) 
X_test = X_test.reshape((X_test.shape[0], X.shape[1] * X.shape[2]))

pca = PCA(n_components=100).fit(X)
X_pca_tr = pca.transform(X_train)
X_pca_te = pca.transform(X_test)

forest1 = RandomForestClassifier(random_state=42)
forest1.fit(X_pca_tr, y_train)
y_pred = forest1.predict(X_pca_te)
print("\nAccuracy:{:,.2f}%".format(accuracy_score(y_true=y_test, y_pred=y_pred_)*100))

The accuracy will be:

You might wonder, does PCA improves the accuracy?

The answer is Yes.

Without PCA :