Search code examples

TermDocumentMatrix doing unrequested cleaning (e.g. removing punctuation)

The TermDocumentMatrix function of the tm package is not functioning according to my understanding of the documentation. It seems to be doing processing on the terms that I have not requested.

Here is an example:

sentence <- "Astrology: I am a Capricorn Sun Cap moon and cap rising...what does that say about me?"
corpus <- Corpus(VectorSource(sentence))
tdm <- TermDocumentMatrix(corpus, control = list(wordLengths = c(1, Inf), 
                                                 removePunctuation = FALSE))

We can see from the output that the punctuation has been removed, and the expression "rising...what" has been split:

 [1] "a"         "about"     "am"        "and"       "astrology" "cap"       "capricorn" "does"      "i"         "me"        "moon"      "rising"    "say"       "sun"       "that"     
[16] "what"  

In the related SO question, the issue was with the tokenizer which was removing the punctuation. However, I am using the default words tokenizer, which I don't believe does this:

> sapply(corpus, words)
 [1,] "Astrology:"   
 [2,] "I"            
 [3,] "am"           
 [4,] "a"            
 [5,] "Capricorn"    
 [6,] "Sun"          
 [7,] "Cap"          
 [8,] "moon"         
 [9,] "and"          
[10,] "cap"          
[11,] "rising...what"
[12,] "does"         
[13,] "that"         
[14,] "say"          
[15,] "about"        
[16,] "me?" 

Is the observed behaviour incorrect, or what is my misunderstanding?


  • You got a SimpleCorpus object, which came with tm package version 0.7 and which - according to ?SimpleCorpus -

    takes internally various shortcuts to boost performance and minimize memory pressure

    # [1] "SimpleCorpus" "Corpus"  

    Now, as help(TermDocumentMatrix) states:

    Available local options are documented in termFreq and are internally delegated to a termFreq call. This is different for a SimpleCorpus. In this case all options are processed in a fixed order in one pass to improve performance. It always uses the Boost Tokenizer (via Rcpp)...

    So you are not using words as tokenizer, which would indeed give you

     [1] "Astrology:"    "I"             "am"            "a"             "Capricorn"     "Sun"           "Cap"          
     [8] "moon"          "and"           "cap"           "rising...what" "does"          "that"          "say"          
    [15] "about"         "me?"  

    As stated in the comments, you could make your corpus explicitly a Volatile ?VCorpus to gain back full flexibility:

    A volatile corpus is fully kept in memory and thus all changes only affect the corresponding R object

    corpus <- VCorpus(VectorSource(sentence)) 
    Terms(TermDocumentMatrix(corpus, control = list(tokenize="words"))