I am training a simple text classification method on 1,000 training examples and would like to make predictions on unseen test data (about 500,000 observations).
The script is working fine, when I work only with unigrams. However, I am not sure how to use control = list(dictionary=Terms(dtm_train_unigram))
when working with unigrams and bigrams as I have two separate document-term-matrices (one for unigrams, one for bigrams, see below):
UnigramTokenizer <- function(x) unlist(lapply(NLP::ngrams(words(x), 1), paste, collapse = " "), use.names = FALSE)
dtm_train_unigram <- DocumentTermMatrix(processed_dataset, control = list(tokenize = UnigramTokenizer, wordLengths=c(3,20), bounds = list(global = c(4,Inf))))
BigramTokenizer <- function(x) unlist(lapply(NLP::ngrams(words(x), 2), paste, collapse = " "), use.names = FALSE)
dtm_train_bigram <- DocumentTermMatrix(processed_dataset, control = list(tokenize = BigramTokenizer, wordLengths=c(6,20), bounds = list(global = c(7,Inf))))
To ensure that the test set has the same terms as the training set, I use the following function:
corpus_test <- VCorpus(VectorSource(test_set))
dtm_test <- DocumentTermMatrix(corpus_test, control = list(dictionary=Terms(dtm_train_unigram), wordLengths = c(3,20)))
How do I feed the terms of both the dtm_train_unigram
and the dtm_train_bigram
to the dtm_test?
dtm_train_unigram
and dtm_train_bigram
to a single dtm after creating them separately (as currently done)? Thank you!
Answering your questions:
Official documentation of tm states the following for combining things.:
Combine several corpora into a single one, combine multiple documents into a corpus, combine multiple term-document matrices into a single one, or combine multiple term frequency vectors into a single term-document matrix.
which in your case would be the answer to 1:
my_dtms <- c(dtm_train_unigram, dtm_train_bigram)
But it does result in doubling the number of documents which is actually not the case.
So we come to point 2, you can create a tokenizer from the NLP package that handles more than just 1 instance of n-gram:
my_tokenizer <- function(x) unlist(lapply(NLP::ngrams(words(x), 1:2), paste, collapse = " "), use.names = FALSE)
note the vector 1:2 ngram function. Change this to 1:3 for 1, 2, 3 grams or 2:3 for just 2 and 3 grams.