Search code examples
rxgboostr-ranger

Feature importance plot using xgb and also ranger. Best way to compare


I'm working on a script that trains both a ranger random forest and a xgb regression. Depending on which performs best based on rmse, one or the other is used to test against hold out data.

I would also like to return feature importance for both in a comparable way.

With the xgboost library, I can get my feature importance table and plot like so:

> xgb.importance(model = regression_model)
                 Feature        Gain       Cover  Frequency
1:              spend_7d 0.981006272 0.982513621 0.79219969
2:                   IOS 0.006824499 0.011105014 0.08112324
3:  is_publisher_organic 0.006379284 0.002917203 0.06770671
4: is_publisher_facebook 0.005789945 0.003464162 0.05897036

Then I can plot it like so:

> xgb.importance(model = regression_model) %>% xgb.plot.importance()

enter image description here

That was using xgboost library and their functions. With ranger random forrest, if I fit a regression model, I can get feature importance if I include importance = 'impurity' while fitting the model. Then:

regression_model$variable.importance
             spend_7d        d7_utility_sum  recent_utility_ratio                   IOS  is_publisher_organic is_publisher_facebook 
         437951687132                     0                     0             775177421             600401959            1306174807 

I could just create a ggplot. But the scales are entirely different between what ranger returns in that table and what xgb shows in the plot.

Is there an out of the box library or solution where I can plot the feature importance of either the xgb or ranger model in a comparable way?


Solution

  • Both the column "Gain" of XGboost and the importances of ranger with parameter "impurity" are constructed via the total decrease in impurity (therefore gain) of the splits of a given variable.

    The only difference appears to be that while XGboost automatically makes the importances in percentage form, ranger keeps them as original values, so sum of squares, which is not very handy to be plotted. You can therefore transform the values of ranger importances by dividing them by the total sum, so that you will have the equivalent percentages as in Xgboost.

    Since using impurity decrease can be sometimes misleading, I however suggest you compute (for both models) the importances of the variables via permutation. This allows you to get the importances in an easy way that is comparable for the different models, and it is more stable.

    I suggest this incredibly helpful post

    Here is the permutation importance, as defined in there (sorry it's Python, not R):

    def permutation_importances(rf, X_train, y_train, metric):
      baseline = metric(rf, X_train, y_train)
      imp = []
      for col in X_train.columns:
        save = X_train[col].copy()
        X_train[col] = np.random.permutation(X_train[col])
        m = metric(rf, X_train, y_train)
        X_train[col] = save
        imp.append(baseline - m)
    return np.array(imp)
    

    However, ranger also allows for permutation importances to be computed via importance="permutation", and xgboost might do so as well.