Search code examples
rquantedak-fold

k-fold cross validation in quanteda


I've been using the quanteda SML workflow as described in the quanteda tutorial (https://tutorials.quanteda.io/machine-learning/nb/) and found it extremely helpful to set up my own classification task. However, instead of the fixed held-out train/test sampling I would like to use a k-fold cross-validation. Could you point me towards the best way to implement it into the workflow? Is there an easy way to apply it in quanteda?

Many thanks

I tried to add a cross validation based on this example: https://rdrr.io/github/quanteda/quanteda.classifiers/man/crossval.html

require(quanteda)
require(quanteda.textmodels)
require(caret)


corp_movies <- data_corpus_moviereviews
summary(corp_movies, 5)
# generate 1500 numbers without replacement
set.seed(300)
id_train <- sample(1:2000, 1500, replace = FALSE)
head(id_train, 10)
# create docvar with ID
corp_movies$id_numeric <- 1:ndoc(corp_movies)

# tokenize texts
toks_movies <- tokens(corp_movies, remove_punct = TRUE, remove_number = TRUE) %>% 
  tokens_remove(pattern = stopwords("en")) %>% 
  tokens_wordstem()
dfmt_movie <- dfm(toks_movies)

# get training set
dfmat_training <- dfm_subset(dfmt_movie, id_numeric %in% id_train)

# get test set (documents not in id_train)
dfmat_test <- dfm_subset(dfmt_movie, !id_numeric %in% id_train)

tmod_nb <- textmodel_nb(dfmat_training, dfmat_training$sentiment)
summary(tmod_nb)

dfmat_matched <- dfm_match(dfmat_test, features = featnames(dfmat_training))

actual_class <- dfmat_matched$sentiment
predicted_class <- predict(tmod_nb, newdata = dfmat_matched)
tab_class <- table(actual_class, predicted_class)
tab_class

require(confusionMatrix)
confusionMatrix(tab_class, mode = "everything", positive = "pos")

#n-fold cross validation
require(crossval)
dfmat <- dfm(toks_movies)
tmod <- textmodel_nb(dfmat, y = data_corpus_moviereviews$sentiment)
crossval(tmod, k = 5, by_class = TRUE)
crossval(tmod, k = 5, by_class = FALSE)
crossval(tmod, k = 5, by_class = FALSE, verbose = TRUE)

but it returns "Error in group.samples(Y) : argument "Y" is missing, with no default"


Solution

  • It should probably be a comment, but I cannot post them yet. I think your problem is caused by the usage of the crossval() function from the improper package. The link you shared suggests that you want to use it from the remote quanteda/quanteda.classifiers package, instead of crossval. The one you used presumably requires a different pipeline cause its definition is different. The used function requires additional X and Y arguments. Their lack is a reason for your error.