Search code examples
rtext-miningtmcorpus

Term frequencies from VCorpus and DTM do not match


I calculated term frequency of test documents both from Corpus and DTM as below. But they didn't match with each other. Can anyone tell me where the mismatch came from? Is it because I used wrong methods to extract term frequency?

library("tm")
library("stringr")
library("dplyr")
test1 <- VCorpus(DirSource("test_papers"))
mytable1 <- lapply(test1, function(x){str_extract_all(x, boundary("word"))}) %>% unlist() %>% table() %>% sort(decreasing=T)
test2 <- DocumentTermMatrix(test1)
mytable2 <- apply(test2, 2, sum) %>% sort(decreasing=T)
head(mytable1)
.
and  of the  to  in  on 
148 116 111  69  61  54 
head(mytable2)
      and       the      this      that       are political 
      145       120        35        34        33        33 

Solution

  • Difference in methods used.

    str_extract_all with boundary("word") removes the punctuations in the sentences. Turning the text into a document term matrix doesn't. To get the same numbers you need to use DocumentTermMatrix(test1, control = list(removePunctuation = TRUE)).

    Detailed explanation:

    In the first case: "this is a text." would return the four words without the period. In the second case you would get text with a period ("text.") in the document term matrix. Now if text appears like this: "text and text." the first case would count "text" = 2, and the document term matrix would count it as "text" = 1 and "text." = 1.

    Using removePunction will remove the period and the counts will be equal.

    You might also want to remove numbers first as well, because removePunctuation removes points and comma's from the numbers.