Search code examples
pythonmachine-learningscikit-learn

Sklearn StackingClassifier very slow and inconsistent cpu usage


I've been trying out the StackingClassifier and StackingRegressor from sklearn recently but I've noticed it's always very slow and uses my cpu inefficiently. Let's say (just for the sake of this example) that I want to use StackingClassifier to stack a random forest and lightgbm, also using lightgbm as the final classifier. In this case I would expect the time it takes to run the StackingClassifier to be roughly equal to the time it takes to run an individual random forest + time to run 2 individual lightgbm + some small margin (so basically the sum of the parts + the time to train the StackingClassifier itself + small margin), however in practice it seems to take several times as long. Example:

import numpy as np
from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import StackingClassifier
import lightgbm as ltb
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import StratifiedKFold

X,y = load_iris(return_X_y=True)
cv = StratifiedKFold(n_splits=10)
lgbm = ltb.LGBMClassifier(n_jobs=4)
rf = RandomForestClassifier()

First just LightGBM, going by wall time this takes about 140ms on my computer:

%%time
scores = cross_val_score(lgbm, X, y, scoring='accuracy', cv=cv, n_jobs=4, error_score='raise')
np.mean(scores)

And just a random forest, this takes about 220ms for me:

%%time
scores = cross_val_score(rf, X, y, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise')
np.mean(scores)

And now a StackingClassifier that combines these two. Since it is basically running the above two blocks of code + another round of lightgbm, I would expect it to take roughly 250+120+120=490ms, but instead it takes about 3000ms, over 6x as long:

%%time
estimators = [
     ('rf', rf),
     ('lgbm,', lgbm)
     ]

clf = StackingClassifier(
    estimators=estimators, final_estimator=lgbm, passthrough=True)

scores = cross_val_score(clf, X, y, scoring='accuracy', cv=cv, n_jobs=4, error_score='raise')
np.mean(scores)    

I've also noticed (when running this exact same code on a bigger dataset so it takes long enough for me to be able to monitor my cpu usage) that the cpu usage with the StackingClassifier is all over the place.

For example, cpu usage running the individual lightgbm:

cpu usage running the individual lightgbm

(basically consistently 100%, so using cpu efficiently)

cpu usage running lightgbm as stackingclassifier

(all over the place, usually nowehere near 100%)

Am I doing something wrong that is causing StackingClassifier to be this much slower than the sum of the parts?


Solution

  • The Sklearn's StackingClassifier does not implement a Cross validation for the estimators, only for the final meta-estimators. Indeed, it implements a cross_val_predict that is set to 5 by default.StackingClassifier Documentation

    In your code you call the StackingClassifier inside the 'cross_val_score'. It is basically executing the cross_val_predict of the StackingClassifier as many times as the cross_val_score's outer cv that you set.

    This may explain the wall time difference.