I want to calibrate my xgboost model which is already trained. According to the documentation:
If “prefit” is passed, it is assumed that base_estimator has been fitted already and all data is used for calibration.
So I have tried to use it as follows:
import pandas as pd
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from xgboost import XGBClassifier
from sklearn.calibration import CalibratedClassifierCV
X, y = make_classification()
X = pd.DataFrame(X)
X.columns = ['var' + str(i) for i in range(1, 21)]
y = pd.Series(y)
X_train, X_test, y_train, y_test = train_test_split(X, y)
model = XGBClassifier()
model.fit(X_train, y_train)
calibrated = CalibratedClassifierCV(model, method='isotonic', cv='prefit')
calibrated.fit(X_test, y_test)
Unfortunetely, this resulted in the following error:
ValueError: feature_names mismatch: ['var1', 'var2', 'var3', 'var4', 'var5', 'var6', 'var7', 'var8', 'var9', 'var10', 'var11', 'var12', 'var13', 'var14', 'var15', 'var16', 'var17', 'var18', 'var19', 'var20'] ['f0', 'f1', 'f2', 'f3', 'f4', 'f5', 'f6', 'f7', 'f8', 'f9', 'f10', 'f11', 'f12', 'f13', 'f14', 'f15', 'f16', 'f17', 'f18', 'f19'] expected var12, var10, var3, var1, var20, var15, var2, var9, var16, var7, var17, var11, var8, var5, var13, var4, var14, var6, var19, var18 in input data training data did not have the following fields: f2, f5, f16, f17, f13, f11, f18, f6, f9, f1, f12, f10, f19, f15, f14, f3, f7, f0, f4, f8
I believe this may be due to the fact that features are stored within xgboost object under default names f1
, f2
etc. Therefore, I have tried to rename X_test
columns using X_test.rename(lambda x: x.replace('var', 'f'), axis = 1)
, but it doesn't solve the issue. So my question is: how can I fix this error and use CalibratedClassifierCV
on trained xgboost
model?
Pandas causes the problem. You have column names passed to the sklearn models which is WRONG.
Use X_train, X_test, y_train, y_test = train_test_split(X.values, y.values)
and everything will work fine.
You need to pass numpy
arrays into any sklearn
function for full compatibility.
Full code:
import pandas as pd
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from xgboost import XGBClassifier
from sklearn.calibration import CalibratedClassifierCV
X, y = make_classification()
X = pd.DataFrame(X)
X.columns = ['var' + str(i) for i in range(1, 21)]
y = pd.Series(y)
X_train, X_test, y_train, y_test = train_test_split(X.values, y.values)
model = XGBClassifier()
model.fit(X_train, y_train)
calibrated = CalibratedClassifierCV(model, method='isotonic', cv='prefit')
calibrated.fit(X_test, y_test)