I'm trying to make a DocumentTermMatrix
in R
, using the parameter control = list()
to limit the terms to a pre-defined list of text-based emojis (:D, :), :(, etc.). However, dtm doesn't pick up certain emojis (like ":D"
or ":)"
), but some other works fine (":))"
) . My code:
text = c(":D", ":))" )
corpus <- Corpus(VectorSource(text)
corpus = tm_map(corpus, PlainTextDocument)
dtm = DocumentTermMatrix(corpus, list(dictionary = c(":D" , ":))" )))
emojidf <- as.data.frame(as.matrix(dtm))
:D :))
1 0 0
2 0 1
To fix this, I could use content_transformer
and gsub
to change the problematic emojis to words. However, I'd like to know how DocumentTermMatrix
or even Corpus
treat punctuation as words.
Two issues (see ?DocumentTermMatrix
and ?termFreq
): The wordLengths filter by default demands a minimum word length of 3 characters. And tolower by default turns :D
into :d
. So try:
library(tm)
text <- c(":D", ":))" )
corpus <- Corpus(VectorSource(text))
dtm <- DocumentTermMatrix(
corpus,
control = list(
dictionary = c(":D" , ":))"),
wordLengths=c(-Inf,Inf),
tolower=FALSE
)
)
as.matrix(dtm)
# Terms
# Docs :)) :D
# 1 0 1
# 2 1 0