I have the following toy example to replicate the issue
import numpy as np
import xgboost as xgb
from sklearn.datasets import make_regression
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
X, y = make_regression(n_samples=30, n_features=5, noise=0.2)
reg = xgb.XGBRegressor(tree_method='hist', eval_metric='mae', n_jobs= 4)
steps = list()
steps.append(('reg', reg))
pipeline = Pipeline(steps=steps)
param_grid = {'reg__max_depth': [2, 4, 6],}
cv = 3
model = GridSearchCV(pipeline, param_grid, cv=cv, scoring='neg_mean_absolute_error')
best_model = model.fit(X = X, y = y)
Then the following four methods fail to save the fitted model:
model.save_model('test_1.json')
# AttributeError: 'GridSearchCV' object has no attribute 'save_model'
best_model.save_model('test2.json')
# AttributeError: 'GridSearchCV' object has no attribute 'save_model'
best_model.best_estimator_.save_model('test3.json')
# AttributeError: 'Pipeline' object has no attribute 'save_model'
model.best_estimator_.save_model('test4.json')
# AttributeError: 'Pipeline' object has no attribute 'save_model'
But these two methods work.
import joblib
joblib.dump(model.best_estimator_, 'naive_model.joblib')
joblib.dump(best_model.best_estimator_, 'naive_best_model.joblib')
Can anyone tell me if it is the way I constructer my pipeline mistakenly breaks the method to save the best model?
Only "xgboost" object has an attribute "save_model". When you use gridsearch it is already a different object wrapped around "xgboost". The same thing with pipelines. You will need to do model.best_estimator_['reg'].save_model
. But it will save only xgboost without any data transformation from pipeline.
"joblib" and "pickle" are more universal solutions, imho