Search code examples
rmachine-learningtextnaivebayesquanteda

Identifying distinct keywords using a classifier with quanteda


I am new to quantitative text analysis, and I am attempting to extract the keywords associated with a particular classification category from the output of a naive bayes classifier. I am running the below example (classifying movie reviews as either positive or negative). I want two vectors, each containing those key words associated with either the positive and negative category respectively. Am I right in saying I should be focusing on the 'Estimated Feature Scores' from the summary() output, and if so, how do I interpret these?

require(quanteda)
require(quanteda.textmodels)
require(caret)

corp_movies <- data_corpus_moviereviews
summary(corp_movies, 5)

# generate 1500 numbers without replacement
set.seed(300)
id_train <- sample(1:2000, 1500, replace = FALSE)
head(id_train, 10)

# create docvar with ID
corp_movies$id_numeric <- 1:ndoc(corp_movies)

# get training set
dfmat_training <- corpus_subset(corp_movies, id_numeric %in% id_train) %>%
  dfm(remove = stopwords("english"), stem = TRUE)

# get test set (documents not in id_train)
dfmat_test <- corpus_subset(corp_movies, !id_numeric %in% id_train) %>%
  dfm(remove = stopwords("english"), stem = TRUE)

tmod_nb <- textmodel_nb(dfmat_training, dfmat_training$sentiment)
summary(tmod_nb) 

Solution

  • If you just want to know the most negative and positive words, consider textstat_keyness() on a dfm created from the entire corpus, partitioned into positive and negative reviews. This does not create two word vectors, but a single word vector with a score indicating the strength of association with the negative or positive category.

    library("quanteda", warn.conflicts = FALSE)
    ## Package version: 2.1.1
    ## Parallel computing: 2 of 12 threads used.
    ## See https://quanteda.io for tutorials and examples.
    data("data_corpus_moviereviews", package = "quanteda.textmodels")
    
    dfmat <- dfm(data_corpus_moviereviews,
      remove = stopwords("english"), stem = TRUE,
      groups = "sentiment"
    )
    
    tstat <- textstat_keyness(dfmat, target = "pos")
    textplot_keyness(tstat)