Search code examples
python-3.xrandom-forestpca

Correct procedure for PCA feature selection then Random Forest classification


After MFCC feature extraction, I am attempting on using PCA feature selection then carrying out classification using Random Forest.

Prior to standardscale() on the data, I have separated out the X_train, y_train, X_test and y_test data.

Step 1: I firstly scale the data as follows:

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train.reshape(-1, X_train.shape[-1])).reshape(X_train.shape)
X_test_scaled = scaler.transform(X_test.reshape(-1, X_test.shape[-1])).reshape(X_test.shape)

# Flatten data for PCA
X_train_scaled = np.array([features_2d.flatten() for features_2d in X_train_scaled])
X_test_scaled = np.array([features_2d.flatten() for features_2d in X_test_scaled])

Step 2: Then I apply PCA and PCA.fit as follows:

pca_train = PCA().fit(X_train_scaled)
pca_train = PCA(n_components = index_95)      # Transformation into 31 principal components
x_pca_train = pca_train.fit_transform(X_train_scaled)

x_pca_test = pca_test.fit_transform(X_test_scaled)

X_train = x_pca_train
X_test = x_pca_test

Step 3: Carry out Random Forest classification.

I wanted to know if the procedure is correct in Step 1 and Step 2 for correct standardscale and PCA analysis for the X_train and X_test data.

Thanks for your time and help!


Solution

  • First of all, PCA is not guaranteed to be useful for classification tasks (see e.g. https://www.csd.uwo.ca/~oveksler/Courses/CS434a_541a/Lecture8.pdf).

    I cannot say whether all the reshapes on the scaler step are needed without knowing your data, however step 2 certainly looks a bit off:

    • Your first call to pca_train = PCA().fit(X_train_scaled) is redundant since you immediately redefine it afterwards.
    • x_pca_test = pca_test.fit_transform(X_test_scaled) looks like a mistake, you should only fit train data and apply transform() to the test set.