r statistics nlp tm term-document-matrix

Creating TermDocumentMatrix: issue with number of documents

I'm attempting to create a term document matrix with a text file that is about 3+ million lines of text. I have created a random sample of the text, which results in about 300,000 lines.

Unfortunately when use the following code I end up with 300,000 documents. I just want 1 document with the frequencies for each bigram:

library(RWeka)
library(tm)

corpus <- readLines("myfile")
numberLinesCorpus <- 3000000
corpus_sample <- text_corpus[sample(1:numberLinesCorpus, numberLinesCorpus*.1, replace = FALSE)]
myCorpus <- Corpus(VectorSource(corpus_sample))
BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 1, max = 2))
tdm <- TermDocumentMatrix(myCorpus, control = list(tokenize = BigramTokenizer))

The sample contains approximately 300,000 lines. However, the number of documents in tdm is also 300,000.

Any help would be much appreciated.

Solution

You'll need to use the paste function on your corpus_sample vector.

Paste, with a value set for collapse takes a vector with many text elements and converts it to a vector with one text elements, where the elements are separated by the string you specify.

text <- c('a', 'b', 'c')
text <- paste(text, collapse = " ")
text
# [1] "a b c"