Consider this simple example
dtrain <- data_frame(text = c("Chinese Beijing Chinese",
"Chinese Chinese Shanghai",
"Chinese Macao",
"Tokyo Japan Chinese"),
doc_id = 1:4,
class = c(1, 1, 1, 0))
dtrain_spark <- copy_to(sc, dtrain, overwrite = TRUE)
> dtrain_spark
# Source: table<dtrain> [?? x 3]
# Database: spark_connection
text doc_id class
<chr> <int> <dbl>
1 Chinese Beijing Chinese 1 1
2 Chinese Chinese Shanghai 2 1
3 Chinese Macao 3 1
4 Tokyo Japan Chinese 4 0
I can train a decision_tree_classifier
easily with the following pipeline
pipeline <- ml_pipeline(
ft_tokenizer(sc, input.col = "text", output.col = "tokens"),
ft_count_vectorizer(sc, input_col = 'tokens', output_col = 'myvocab'),
ml_decision_tree_classifier(sc, label_col = "class",
features_col = "myvocab",
prediction_col = "pcol",
probability_col = "prcol",
raw_prediction_col = "rpcol")
)
model <- ml_fit(pipeline, dtrain_spark)
Now the issue is that I cannot extract in a meaningful way the feature_importances
.
Running
> ml_stage(model, 'decision_tree_classifier')$feature_importances
[1] 0 0 1 0 0 0
But what I want is tokens
! In my real life example i have thousands of them and shown it is hard to understand anything.
Is there any way to back out the tokens
from the matrix representation above?
Thanks!
You can easily combine CountVectorizerModel
vocabulary
and feature_importances
:
tibble(
token = unlist(ml_stage(model, 'count_vectorizer')$vocabulary),
importance = ml_stage(model, 'decision_tree_classifier')$feature_importances
)
# A tibble: 6 x 2
token importance
<chr> <dbl>
1 chinese 0
2 japan 1
3 shanghai 0
4 beijing 0
5 tokyo 0
6 macao 0