The TermDocumentMatrix
function of the tm
package is not functioning according to my understanding of the documentation. It seems to be doing processing on the terms that I have not requested.
Here is an example:
require(tm)
sentence <- "Astrology: I am a Capricorn Sun Cap moon and cap rising...what does that say about me?"
corpus <- Corpus(VectorSource(sentence))
tdm <- TermDocumentMatrix(corpus, control = list(wordLengths = c(1, Inf),
removePunctuation = FALSE))
rownames(tdm)
We can see from the output that the punctuation has been removed, and the expression "rising...what" has been split:
[1] "a" "about" "am" "and" "astrology" "cap" "capricorn" "does" "i" "me" "moon" "rising" "say" "sun" "that"
[16] "what"
In the related SO question, the issue was with the tokenizer which was removing the punctuation. However, I am using the default words
tokenizer, which I don't believe does this:
> sapply(corpus, words)
[,1]
[1,] "Astrology:"
[2,] "I"
[3,] "am"
[4,] "a"
[5,] "Capricorn"
[6,] "Sun"
[7,] "Cap"
[8,] "moon"
[9,] "and"
[10,] "cap"
[11,] "rising...what"
[12,] "does"
[13,] "that"
[14,] "say"
[15,] "about"
[16,] "me?"
Is the observed behaviour incorrect, or what is my misunderstanding?
You got a SimpleCorpus
object, which came with tm package version 0.7 and which - according to ?SimpleCorpus
-
takes internally various shortcuts to boost performance and minimize memory pressure
class(corpus)
# [1] "SimpleCorpus" "Corpus"
Now, as help(TermDocumentMatrix)
states:
Available local options are documented in termFreq and are internally delegated to a termFreq call. This is different for a SimpleCorpus. In this case all options are processed in a fixed order in one pass to improve performance. It always uses the Boost Tokenizer (via Rcpp)...
So you are not using words
as tokenizer, which would indeed give you
words(sentence)
[1] "Astrology:" "I" "am" "a" "Capricorn" "Sun" "Cap"
[8] "moon" "and" "cap" "rising...what" "does" "that" "say"
[15] "about" "me?"
As stated in the comments, you could make your corpus explicitly a Volatile ?VCorpus
to gain back full flexibility:
A volatile corpus is fully kept in memory and thus all changes only affect the corresponding R object
corpus <- VCorpus(VectorSource(sentence))
Terms(TermDocumentMatrix(corpus, control = list(tokenize="words"))