r apache-spark apache-spark-mllib apache-spark-ml sparklyr

how to extract the feature importances in Sparklyr?

Consider this simple example

dtrain <- data_frame(text = c("Chinese Beijing Chinese",
                              "Chinese Chinese Shanghai",
                              "Chinese Macao",
                              "Tokyo Japan Chinese"),
                     doc_id = 1:4,
                     class = c(1, 1, 1, 0))

dtrain_spark <- copy_to(sc, dtrain, overwrite = TRUE)

> dtrain_spark
# Source:   table<dtrain> [?? x 3]
# Database: spark_connection
  text                     doc_id class
  <chr>                     <int> <dbl>
1 Chinese Beijing Chinese       1     1
2 Chinese Chinese Shanghai      2     1
3 Chinese Macao                 3     1
4 Tokyo Japan Chinese           4     0

I can train a decision_tree_classifier easily with the following pipeline

pipeline <- ml_pipeline(
  ft_tokenizer(sc, input.col = "text", output.col = "tokens"),
  ft_count_vectorizer(sc, input_col = 'tokens', output_col = 'myvocab'),
  ml_decision_tree_classifier(sc, label_col = "class", 
                 features_col = "myvocab", 
                 prediction_col = "pcol",
                 probability_col = "prcol", 
                 raw_prediction_col = "rpcol")
)

model <- ml_fit(pipeline, dtrain_spark)

Now the issue is that I cannot extract in a meaningful way the feature_importances.

Running

> ml_stage(model, 'decision_tree_classifier')$feature_importances
[1] 0 0 1 0 0 0

But what I want is tokens! In my real life example i have thousands of them and shown it is hard to understand anything.

Is there any way to back out the tokens from the matrix representation above?

Thanks!

Solution

You can easily combine CountVectorizerModel vocabulary and feature_importances:

tibble(
  token = unlist(ml_stage(model, 'count_vectorizer')$vocabulary),
  importance = ml_stage(model, 'decision_tree_classifier')$feature_importances
)

# A tibble: 6 x 2
  token    importance
  <chr>         <dbl>
1 chinese           0
2 japan             1
3 shanghai          0
4 beijing           0
5 tokyo             0
6 macao             0