After MFCC feature extraction, I am attempting on using PCA feature selection then carrying out classification using Random Forest.
Prior to standardscale()
on the data, I have separated out the X_train
, y_train
, X_test
and y_test
data.
Step 1: I firstly scale the data as follows:
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train.reshape(-1, X_train.shape[-1])).reshape(X_train.shape)
X_test_scaled = scaler.transform(X_test.reshape(-1, X_test.shape[-1])).reshape(X_test.shape)
# Flatten data for PCA
X_train_scaled = np.array([features_2d.flatten() for features_2d in X_train_scaled])
X_test_scaled = np.array([features_2d.flatten() for features_2d in X_test_scaled])
Step 2: Then I apply PCA
and PCA.fit
as follows:
pca_train = PCA().fit(X_train_scaled)
pca_train = PCA(n_components = index_95) # Transformation into 31 principal components
x_pca_train = pca_train.fit_transform(X_train_scaled)
x_pca_test = pca_test.fit_transform(X_test_scaled)
X_train = x_pca_train
X_test = x_pca_test
Step 3: Carry out Random Forest
classification.
I wanted to know if the procedure is correct in Step 1 and Step 2 for correct standardscale
and PCA
analysis for the X_train
and X_test
data.
Thanks for your time and help!
First of all, PCA is not guaranteed to be useful for classification tasks (see e.g. https://www.csd.uwo.ca/~oveksler/Courses/CS434a_541a/Lecture8.pdf).
I cannot say whether all the reshapes on the scaler step are needed without knowing your data, however step 2 certainly looks a bit off:
pca_train = PCA().fit(X_train_scaled)
is redundant since you immediately redefine it afterwards.x_pca_test = pca_test.fit_transform(X_test_scaled)
looks like a mistake, you should only fit train data and apply transform()
to the test set.