Search code examples
rrandom-forestsparklyr

How to get feature importance of the best model from cross validator in sparklyr?


I'm able to train random forest cross validator in sparklyr but cannot find a way to get feature importance for the best model.

If I train a simple random forest model, I can use:

fit <- ml_random_forest(...)
feature_imp <- ml_tree_feature_importance(fit)

However, if I do the same thing to the best model from cross validator, I will get error:

> cv_model <- ml_fit(cv, df_training)
> feature_imp <- ml_tree_feature_importance(cv_model$best_model)
Error in UseMethod("ml_feature_importances") : 
no applicable method for 'ml_feature_importances' applied to an object of class "c('ml_pipeline_model', 'ml_transformer', 'ml_pipeline_stage')"

Is there a way to get feature importance for the best model from cross validator?

The key to this question is,

  1. what is the difference between the output of model_fit and the output of ml_random_forest?
  2. What functions can be applied on one and what can be applied on the other?
  3. Can they be converted to each other?

Solution

  • I looked closely into the structure of best model in a cross validator.

    For a tree based model (I checked GBT and RF), in the algorithm stage there is a component called feature_importances that contains the values for all real variables (different from the variable names given by the feature assembler stage where one hot variables are not expanded).

    It's sad that this feature_importances vector is not named, and I have to figure out corresponding variable name for each value.

    My thought is, from feature assembler we can get a simplified vector of column names where one hot encoded variables are not expanded, and for each one hot encoded variables we just need to replace it with a set of variable names with levels to eventually come up with the full column name vector -- I assume the order of variables is the same as that given by the feature assembler.

    To get levels of one hot variables, we can go back to stages with uid containing one_hot_encoder_ to first get one hot encoded variables, then go back to stages with uid containing string_indexer_ to get levels (stored in a sublist named labels) for each one hot encoded variable. Note that since this is in essence dummy encoding, one of the level is used as reference level that will not show up as a separate variable, and I assume the first level encountered and recorded in labels is the reference level, and the order of real variables for a particular one hot encoded variable is the same as that given in labels.

    Under these 3 assumptions, I'm able to reconstruct the column name vector and attach it to the feature importance vector to form a feature importance table like what I get from applying ml_feature_importances() on a GBT or RF model trained without cross validator.