Search code examples
rstatisticsnlptmterm-document-matrix

Creating TermDocumentMatrix: issue with number of documents


I'm attempting to create a term document matrix with a text file that is about 3+ million lines of text. I have created a random sample of the text, which results in about 300,000 lines.

Unfortunately when use the following code I end up with 300,000 documents. I just want 1 document with the frequencies for each bigram:

library(RWeka)
library(tm)

corpus <- readLines("myfile")
numberLinesCorpus <- 3000000
corpus_sample <- text_corpus[sample(1:numberLinesCorpus, numberLinesCorpus*.1, replace = FALSE)]
myCorpus <- Corpus(VectorSource(corpus_sample))
BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 1, max = 2))
tdm <- TermDocumentMatrix(myCorpus, control = list(tokenize = BigramTokenizer))

The sample contains approximately 300,000 lines. However, the number of documents in tdm is also 300,000.

Any help would be much appreciated.


Solution

  • You'll need to use the paste function on your corpus_sample vector.

    Paste, with a value set for collapse takes a vector with many text elements and converts it to a vector with one text elements, where the elements are separated by the string you specify.

    text <- c('a', 'b', 'c')
    text <- paste(text, collapse = " ")
    text
    # [1] "a b c"