Beginning in xgboost version 1.6, you can now run multioutput models directly. In the past, I had been using the scikit learn wrapper MultiOutputRegressor around an xgbregressor estimator. I could then access the individual models feature importance by using something thing like wrapper.estimators_[i].feature_importances_
Now, however, when I run feature_importances_ on a multioutput model of xgboostregressor, I only get one set of features even through I have more than one target. Any idea what this array of feature importances actually represents? Is it the first, last, or some sort of average across all the targets? Is this a function that perhaps just not ready to handle multioutput?
*Realize questions are always easier to answer when you have some code to test:
import numpy as np
from sklearn import datasets
import xgboost as xgb
from sklearn.multioutput import MultiOutputRegressor
from numpy.testing import assert_almost_equal
n_targets = 3
X, y = datasets.make_regression(n_targets=n_targets)
X_train, y_train = X[:50], y[:50]
X_test, y_test = X[50:], y[50:]
single_run_features = {}
references = np.zeros_like(y_test)
for n in range(0,n_targets):
xgb_indi_model = xgb.XGBRegressor(random_state=0)
xgb_indi_model.fit(X_train, y_train[:, n])
references[:,n] = xgb_indi_model.predict(X_test)
single_run_features[n] = xgb_indi_model.feature_importances_
xgb_multi_model = xgb.XGBRegressor(random_state=0)
xgb_multi_model.fit(X_train, y_train)
y__multi_pred = xgb_multi_model.predict(X_test)
xgb_scikit_model = MultiOutputRegressor(xgb.XGBRegressor(random_state=0))
xgb_scikit_model.fit(X_train, y_train)
y_pred = xgb_scikit_model.predict(X_test)
print(assert_almost_equal(references, y_pred))
print(assert_almost_equal(y__multi_pred, y_pred))
scikit_features = {}
for i in range(0,n_targets):
scikit_features[i] = xgb_scikit_model.estimators_[i].feature_importances_
xgb_multi_model_features = xgb_multi_model.feature_importances_
single_run_features
scikit_features
The features importances match for the loop version of single target model, single_run_features, and the MultiOutputRegressor version, scikit_features. The issues is the results in xgb_multi_model_features. Any suggestions??
Is it the first, last, or some sort of average across all the targets
It's the average of all targets.