I'm studying machine learning and NLP in Python by recreating the common "predict spam messages" project. I did all the preliminary steps of cleanup and preprocessing until I got a TF-IDF document-term matrix of 2,000 terms. I then performed SVD to reduce it to 300 terms (or components) and scaled the results so that I could run a quick logistic classifier to get a benchmark for later models.
Later in the project, while building random forests, I realized I had forgotten to comment out the scaler below and was building the forests with the scaled SVD, which is totally unnecessary. However, I did not realize this would slow down the random forests compared to the unscaled SVD, and worse, sensitivity was about 10% lower as well.
Can anyone help me understand why this is so?
Here are results of the grid search with the best (highest sensitivity) unscaled SVD:
Elapsed: 1348 s
Best params: {'max_depth': 20, 'max_features': 250, 'min_samples_split': 10, 'n_estimators': 200}
Confusion matrix on validation set:
pred_neg pred_pos
neg 844 2
pos 5 124
Evaluation metrics:
accuracy: 0.9928
sensitivity: 0.9612
specificity: 0.9976
Here are the results of the grid search with the best (highest sensitivity) scaled SVD:
Elapsed: 5297 s
Best params: {'max_depth': 5, 'max_features': 250, 'min_samples_split': 5, 'n_estimators': 200}
Confusion matrix on validation set:
pred_neg pred_pos
neg 838 8
pos 18 111
Evaluation metrics:
accuracy: 0.9733
sensitivity: 0.8605
specificity: 0.9905
Here's the culprit:
from scipy.sparse.linalg import svds
from sklearn.utils.extmath import svd_flip
from sklearn.preprocessing import MaxAbsScaler
def perform_SVD(X, n_components=300):
# transpose to a term-document matrix
U, Sigma, VT = svds(X.asfptype().T,
k=n_components)
# reverse outputs
Sigma = Sigma[::-1]
U, VT = svd_flip(U[:, ::-1], VT[::-1])
# transpose to get V
V = VT.T
# scale for logistic classifier only
# can't take log of negative numbers
# ends up predicting ham base rate
# comment out for random forests!
scaler = MaxAbsScaler()
X_scaled = scaler.fit_transform(V)
return X_scaled
This is not really surprising. Imagine that you are interested in classification and you have a dataset that looks like this -
If you are trying to fit a decision tree to this, it's very easy for it to find the decision boundary and hence you classification accuracy will be quite good.
Now imagine if you are trying to scale it first. The new dataset will look like this -
As you can see, there is a lot more overlap between the data so it's more difficult for the model to find a decision boundary.
When you are scaling the data, you are bringing the two axes closer to each other. This might have the effect of making them less distinguishable.
At this point you might be wondering, if this is the case, why do we bother doing this re-scaling at all. After all, this effect will be seen regardless of what model you use. While that is true, and doing can have an effect of making the data less distinguable, in models like Neural Net, if you don't do this scaling operation, there are a lot of other downsides that will pop up. Like the weights of one feature being artificially inflated, or gradients not flowing properly and so on. In that case, the advantages of scaling might overweigh the disadvantages and you can still end up with a good model.
As to your question on why there would be a difference in speed, the same effect, the random forest will probably have to search for a longer time to get a good fit in the latter case with the same parameters. It's not really surprising.
Here is the code used to produce the plots -
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import MaxAbsScaler
std = 7
X = np.random.multivariate_normal([2, 2], [[std, 0], [0, std]], size=100)
Y = np.random.multivariate_normal([10, 10], [[std, 0], [0, std]], size=100)
plt.scatter(X[:, 0], X[:, 1])
plt.scatter(Y[:, 0], Y[:, 1])
plt.show()
scaler = MaxAbsScaler()
X_scaled = scaler.fit_transform(X)
Y_scaled = scaler.fit_transform(Y)
plt.scatter(X_scaled[:, 0], X_scaled[:, 1])
plt.scatter(Y_scaled[:, 0], Y_scaled[:, 1])
plt.show()