Search code examples
pythonscikit-learnpredictionxgboostfeature-selection

What is the need of Booster object in XGBoost? Also, how to use it in SelectfromModel of SkLearn?



I am trying to use XGBoost for prediction by extracting the important features and then using them to predict the values. I have used two codes, one with booster and one without. The feature importances are different in both the cases.
xg_reg = xgb.XGBRegressor(objective ='reg:squarederror', colsample_bytree = 0.3, learning_rate = 0.01,max_depth = 6, reg_alpha = 15, n_estimators = 1000, subsample = 0.5)

xg_reg_1 = xgb.train(params=params, dtrain=data_dmatrix, num_boost_round=300)

Also, if I used booster object in SelectfromModel, it throws an error.Kindly let me know the changes to be made to the code.

xgb_fea_imp=pd.DataFrame(list(xg_reg_1.get_fscore().items()),columns=['feature','importance']).sort_values('importance', ascending=False)
threshold1 = xgb_fea_imp.T.to_numpy()

from sklearn.feature_selection import SelectFromModel    
# select the features
selection = SelectFromModel(xg_reg_1, threshold=threshold1[5], prefit=True)    
feature_idx = selection.get_support()
feature_name = X.columns[feature_idx]
   
selected_dataset = selection.transform(X)
selected_dataset = pd.DataFrame(selected_dataset)
selected_dataset.columns = feature_name

The error is as follows :

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-12-b089dd085f01> in <module>
      4 selection = SelectFromModel(xg_reg_1, threshold=threshold1[5], prefit=True)
      5 
----> 6 feature_idx = selection.get_support()
      7 feature_name = X.columns[feature_idx]
      8 #print(feature_idx)

~\Anaconda3\lib\site-packages\sklearn\feature_selection\_base.py in get_support(self, indices)
     50             values are indices into the input feature vector.
     51         """
---> 52         mask = self._get_support_mask()
     53         return mask if not indices else np.where(mask)[0]
     54 

~\Anaconda3\lib\site-packages\sklearn\feature_selection\_from_model.py in _get_support_mask(self)
    186                              ' "prefit=True" while passing the fitted'
    187                              ' estimator to the constructor.')
--> 188         scores = _get_feature_importances(
    189             estimator=estimator, getter=self.importance_getter,
    190             transform_func='norm', norm_order=self.norm_order)

~\Anaconda3\lib\site-packages\sklearn\feature_selection\_base.py in _get_feature_importances(estimator, getter, transform_func, norm_order)
    171                 getter = attrgetter('feature_importances_')
    172             else:
--> 173                 raise ValueError(
    174                     f"when `importance_getter=='auto'`, the underlying "
    175                     f"estimator {estimator.__class__.__name__} should have "

ValueError: when `importance_getter=='auto'`, the underlying estimator Booster should have `coef_` or `feature_importances_` attribute. Either pass a fitted estimator to feature selector or call fit before calling transform.

If I then move ahead, and state Prefit=False, it asks to fit the model before using.


Solution

  • You shouldn't build the xgboost regression model using its core API. The train function of xgb returns a Booster object, which does not have coef_ or feature_importances_ attributes. Use xgb.XGBRegressor which is compatible with Sklearn and has feature_importances_ that can be used inside SelectFromModel.