I'm working on a script that trains both a ranger random forest and a xgb regression. Depending on which performs best based on rmse, one or the other is used to test against hold out data.
I would also like to return feature importance for both in a comparable way.
With the xgboost library, I can get my feature importance table and plot like so:
> xgb.importance(model = regression_model)
Feature Gain Cover Frequency
1: spend_7d 0.981006272 0.982513621 0.79219969
2: IOS 0.006824499 0.011105014 0.08112324
3: is_publisher_organic 0.006379284 0.002917203 0.06770671
4: is_publisher_facebook 0.005789945 0.003464162 0.05897036
Then I can plot it like so:
> xgb.importance(model = regression_model) %>% xgb.plot.importance()
That was using xgboost library and their functions. With ranger random forrest, if I fit a regression model, I can get feature importance if I include importance = 'impurity'
while fitting the model. Then:
regression_model$variable.importance
spend_7d d7_utility_sum recent_utility_ratio IOS is_publisher_organic is_publisher_facebook
437951687132 0 0 775177421 600401959 1306174807
I could just create a ggplot. But the scales are entirely different between what ranger returns in that table and what xgb shows in the plot.
Is there an out of the box library or solution where I can plot the feature importance of either the xgb or ranger model in a comparable way?
Both the column "Gain" of XGboost and the importances of ranger with parameter "impurity" are constructed via the total decrease in impurity (therefore gain) of the splits of a given variable.
The only difference appears to be that while XGboost automatically makes the importances in percentage form, ranger keeps them as original values, so sum of squares, which is not very handy to be plotted. You can therefore transform the values of ranger importances by dividing them by the total sum, so that you will have the equivalent percentages as in Xgboost.
Since using impurity decrease can be sometimes misleading, I however suggest you compute (for both models) the importances of the variables via permutation. This allows you to get the importances in an easy way that is comparable for the different models, and it is more stable.
I suggest this incredibly helpful post
Here is the permutation importance, as defined in there (sorry it's Python, not R):
def permutation_importances(rf, X_train, y_train, metric):
baseline = metric(rf, X_train, y_train)
imp = []
for col in X_train.columns:
save = X_train[col].copy()
X_train[col] = np.random.permutation(X_train[col])
m = metric(rf, X_train, y_train)
X_train[col] = save
imp.append(baseline - m)
return np.array(imp)
However, ranger also allows for permutation importances to be computed via importance="permutation"
, and xgboost might do so as well.