Search code examples
rtext-miningtm

Keeping punctuation in R Document Term Matrix


I'm trying to make a DocumentTermMatrix in R, using the parameter control = list() to limit the terms to a pre-defined list of text-based emojis (:D, :), :(, etc.). However, dtm doesn't pick up certain emojis (like ":D" or ":)"), but some other works fine (":))") . My code:

text = c(":D", ":))" ) 
corpus <- Corpus(VectorSource(text)
corpus = tm_map(corpus, PlainTextDocument)
dtm = DocumentTermMatrix(corpus, list(dictionary = c(":D" , ":))" )))
emojidf <- as.data.frame(as.matrix(dtm))

  :D :))
1  0   0
2  0   1

To fix this, I could use content_transformer and gsub to change the problematic emojis to words. However, I'd like to know how DocumentTermMatrix or even Corpus treat punctuation as words.


Solution

  • Two issues (see ?DocumentTermMatrix and ?termFreq): The wordLengths filter by default demands a minimum word length of 3 characters. And tolower by default turns :D into :d. So try:

    library(tm)
    text <- c(":D", ":))" ) 
    corpus <- Corpus(VectorSource(text))
    dtm <- DocumentTermMatrix(
      corpus, 
      control = list(
        dictionary = c(":D" , ":))"), 
        wordLengths=c(-Inf,Inf), 
        tolower=FALSE
      )
    )
    as.matrix(dtm)
    #     Terms
    # Docs :)) :D
    #    1   0  1
    #    2   1  0