Search code examples
pythonscikit-learnrandom-forestpcamnist

Why does performance suffer when fitting a Random Forest model after reducing with PCA?


This question has to do with comparing speed between a Random Forest Classifier model on a full set of features vs a Random Forest model on a reduced number of components after doing PCA. I'm using the MNIST dataset which has 60,000 rows for my training (X_train) and 10,000 for my test (X_test), and 784 features which are pixels representing the 28x28 image.

For the full set of features, I am measuring the time it takes to fit using clock() like so:

clf = RandomForestClassifier()
t0 = time.clock()
clf.fit(X_train, y_train)
runtime = time.clock() - t0

For doing PCA and Random Forest, I am doing something similar:

pca = PCA(n_components = 0.95)
t0 = time.clock()
components = pca.fit_transform(X_train)
clf.fit(components, y_train)
runtime = time.clock() - t0

For the full set, I get a runtime of ~6 seconds while for the second set, I get a runtime of ~27 seconds. Even if I split out to look at just the runtimes of fitting (removing the time it takes to do the pca), I still consistently get approx 6 seconds compared against 14 seconds. The number of features for the full set is 784 while the PCA reduced that to 154 components. My limited understanding is that at the very least, fitting the model should be quicker with PCA because of the reduced number of features - why is it not?

I've tried scaling before PCA, tuning hyperparameters, among other things but it is pretty consistent the counter-intuitive difference in runtime and I believe there's just something I'm not understanding conceptually.


Solution

  • Difference in features

    You said that originally you have784 features, but you reduce it to 154. That may seem like a lot. However if you look at the documentation:

    max_features : int, float, string or None, optional (default=”auto”)

    The number of features to consider when looking for the best split:

    • If “auto”, then max_features=sqrt(n_features).

    That means that your original problem was sqrt(784) = 28 and you reduced it to sqrt(154) = 12.

    Yes, it is smaller now, but not as small as you originally thought.

    Optimization

    The way that your Random forest is built, is by looking at possible splits and picking the best ones according to a certain criteria. Note the documentation:

    criterion : string, optional (default=”gini”)

    The function to measure the quality of a split. Supported criteria are “gini” for the Gini impurity and “entropy” for the information gain. Note: this parameter is tree-specific.

    [...]

    Note: the search for a split does not stop until at least one valid partition of the node samples is found, even if it requires to effectively inspect more than max_features features.

    So, while fitting, the algorithm is iterating over possible splits that optimize the criterion. However, by reducing the number of features you might have made the problem to find this splits more difficult (by having less good splits to find), which makes the algorithm need more iterations to find a good split.