python machine-learning scikit-learn svm

Scikit-learn BaggingRegressor with SVR fast to train but slow to predict

I see a number of questions about SVM speed, but nothing about the difference between training and prediction. Here is the code for the model in question:

from sklearn.svm import SVR
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.compose import TransformedTargetRegressor 
from sklearn.ensemble import BaggingRegressor

svr = SVR(C=1e-1, epsilon=0.1, tol=1e0)
pipeline = Pipeline([('scaler', StandardScaler()), ('model', svr)])
model = TransformedTargetRegressor(regressor=pipeline, transformer=StandardScaler())
model = BaggingRegressor(base_estimator=model, n_estimators=20, max_samples=1/20, n_jobs=-1)

The above is able to train on close to 500,000 samples with 50 features in well under 2 minutes, but it takes > 20 minutes to predict half as many samples. As a side note, it took close to 10 hours to train the TransformedTargetRegressor without the bagging and several hours to make the predictions. So not only is the training much faster than the prediction with bagging, but the time savings for training that comes from bagging is much greater than the time savings for prediction.

Is there anything that can be done about this? Or at the least, is there something specifically about SVM/SVR models might be causing it?

Solution

You trained each SVM with less data than you're using for inference (1/20 * 500K) and an RBF SVM scales poorly for both training and inference (though differently). If you want to use an RBF SVM, you may want to use a faster implementation such as the one in cuML (requires an NVIDIA GPU) (disclaimer: I work on this project). (edited)

I get the following performance with a random 500K x 20 dataset on my machine [0].

import cuml

from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.compose import TransformedTargetRegressor 
from sklearn.ensemble import BaggingRegressor
from sklearn.datasets import make_regression

X, y = make_regression(n_samples=500000, n_features=20)

svr = cuml.svm.SVR(C=1e-1, epsilon=0.1, tol=1e0)
pipeline = Pipeline([('scaler', StandardScaler()), ('model', svr)])
model = TransformedTargetRegressor(regressor=pipeline, transformer=StandardScaler())
model = BaggingRegressor(base_estimator=model, n_estimators=20, max_samples=1/20)

%time model.fit(X,y)
%time preds = model.predict(X)

CPU times: user 1.58 s, sys: 156 ms, total: 1.73 s
Wall time: 1.73 s
CPU times: user 7.23 s, sys: 485 ms, total: 7.72 s
Wall time: 7.73 s

[0] System

CPU: Intel(R) Xeon(R) Gold 6128 CPU @ 3.40GHz, CPU(s): 12
GPU: Quadro RTX 8000