python machine-learning random-forest mnist

MNIST - huge difference between validation and test score

I have a task to recognize MNIST without using NN. Accuracy should be => 0.7 I`v fit random forest and got 0.88 on validation, but test accuracy is 0.2. Such a big difference is fully unclear to me.

Here is my code

train = np.loadtxt('./data/digit/train.csv', delimiter=',', skiprows=1)
test = np.loadtxt('./data/digit/test.csv', delimiter=',', skiprows=1)
# creating variable for labels
train_label = train[:, 0]
# changing shape
train_img = np.resize(train[:, 1:], (train.shape[0], 28, 28))
test_img = np.resize(test, (test.shape[0], 28, 28))

#creating vectors
X_train = train_img.reshape(-1, 28 * 28).astype(np.float32)
X_test = test_img.reshape(-1, 28 * 28).astype(np.float32)
#and normalize them by the mean.
X_mean = X_train.mean(axis=0)
X__mean = X_test.mean(axis=0)
X_train -= X_mean
X_test -= X__mean 

#calculate covariation matrix and make svd for train
cov = np.dot(X_train.T, X_train) / X_train.shape[0]
U, S, V = np.linalg.svd(cov)
#and for test
test_cov = np.dot(X_test.T, X_test) / X_test.shape[0]
U_, S_, V_ = np.linalg.svd(test_cov)

#visualize and analise PCA
S_cumsum = np.cumsum(S) / np.sum(S)
px.line(S_cumsum)

#Change features array dimension
S_thr = 0.8  
n_comp = np.argmax(np.where(S_cumsum > S_thr, 1, 0))
X_train = np.dot(X_train, U[:, :n_comp])
X_test = np.dot(X_test, U_[:, :n_comp])

#divide x_train for train and validation
x_train, x_val, y_train, y_val,  = train_test_split(X_train, train_label, test_size=0.2, random_state=42)

#fit with RandomizedSearchCV
rf_params = {'criterion': ['gini', 'entropy'], 'n_estimators': range(50, 150, 10), 'max_depth': range(3,10), 'max_features': np.arange(0.3, 1.0, 0.1), 'min_samples_leaf': range(2, 5)}
rf_model = RandomizedSearchCV(RandomForestClassifier(), rf_params, n_iter = 15, cv = 7, random_state = 42)
rf_model.fit(x_train, y_train)

rf_model.best_score_ = 0.8750297619047619
Accuracy: 0.8751190476190476

pred = rf_model.predict(X_test) #... and 0.2 only

Also tried with logistic regression, but the same result.

Solution

It looks like the main issue is with how you're handling the PCA transformation. You're computing and applying PCA separately for the train and test sets, leading to inconsistencies in the reduced-dimensional representations.

You should fit the PCA using only the training data and then apply the same transformation to both the training and test data. Try this:

# Calculate covariation matrix and make SVD for train
cov = np.dot(X_train.T, X_train) / X_train.shape[0]
U, S, V = np.linalg.svd(cov)

# Visualize and analyze PCA
S_cumsum = np.cumsum(S) / np.sum(S)
S_thr = 0.8
n_comp = np.argmax(np.where(S_cumsum > S_thr, 1, 0))

# Apply the same U to both training and test sets
X_train = np.dot(X_train, U[:, :n_comp])
X_test = np.dot(X_test, U[:, :n_comp]) # Use U, not U_

After applying this change, your validation and test scores should be more consistent.