I'm attempting to create a term document matrix with a text file that is about 3+ million lines of text. I have created a random sample of the text, which results in about 300,000 lines.
Unfortunately when use the following code I end up with 300,000 documents. I just want 1 document with the frequencies for each bigram:
library(RWeka)
library(tm)
corpus <- readLines("myfile")
numberLinesCorpus <- 3000000
corpus_sample <- text_corpus[sample(1:numberLinesCorpus, numberLinesCorpus*.1, replace = FALSE)]
myCorpus <- Corpus(VectorSource(corpus_sample))
BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 1, max = 2))
tdm <- TermDocumentMatrix(myCorpus, control = list(tokenize = BigramTokenizer))
The sample contains approximately 300,000 lines. However, the number of documents in tdm is also 300,000.
Any help would be much appreciated.
You'll need to use the paste
function on your corpus_sample
vector.
Paste, with a value set for collapse
takes a vector with many text elements and converts it to a vector with one text elements, where the elements are separated by the string you specify.
text <- c('a', 'b', 'c')
text <- paste(text, collapse = " ")
text
# [1] "a b c"