I calculated term frequency of test documents both from Corpus and DTM as below. But they didn't match with each other. Can anyone tell me where the mismatch came from? Is it because I used wrong methods to extract term frequency?
library("tm")
library("stringr")
library("dplyr")
test1 <- VCorpus(DirSource("test_papers"))
mytable1 <- lapply(test1, function(x){str_extract_all(x, boundary("word"))}) %>% unlist() %>% table() %>% sort(decreasing=T)
test2 <- DocumentTermMatrix(test1)
mytable2 <- apply(test2, 2, sum) %>% sort(decreasing=T)
head(mytable1)
.
and of the to in on
148 116 111 69 61 54
head(mytable2)
and the this that are political
145 120 35 34 33 33
Difference in methods used.
str_extract_all
with boundary("word")
removes the punctuations in the sentences. Turning the text into a document term matrix doesn't. To get the same numbers you need to use DocumentTermMatrix(test1, control = list(removePunctuation = TRUE))
.
Detailed explanation:
In the first case: "this is a text." would return the four words without the period. In the second case you would get text with a period ("text.") in the document term matrix. Now if text appears like this: "text and text." the first case would count "text" = 2, and the document term matrix would count it as "text" = 1 and "text." = 1.
Using removePunction will remove the period and the counts will be equal.
You might also want to remove numbers first as well, because removePunctuation removes points and comma's from the numbers.