Let's say there is a string "COLORED PENCIL STAEDTLER NORIS CLUB ASSORTED COLORS PKT12". My code is:
> a1 <- VCorpus(VectorSource("COLORED PENCIL STAEDTLER NORIS CLUB ASSORTED COLORS PKT12"))
> a3 <- TermDocumentMatrix(a1,control = list(stemming=T))
The matrix is:
Docs
Terms 1
assort 1
club 1
color 2
nori 1
pencil 1
pkt12 1
staedtler 1
So we can see stemDocument works for colored and colors, both of which turned to be color. However, if I do:
> a1 <- VCorpus(VectorSource("COLORED PENCIL STAEDTLER NORIS CLUB ASSORTED COLORS PKT12"))
> a2 <- a1 %>% tm_map(PlainTextDocument) %>% tm_map(stemDocument,"english")
> a2[[1]]$content
[1] "COLORED PENCIL STAEDTLER NORIS CLUB ASSORTED COLORS PKT12"
> a2 <- a2 %>% TermDocumentMatrix()
The matrix is:
Docs
Terms character(0)
assorted 1
club 1
colored 1
colors 1
noris 1
pencil 1
pkt12 1
staedtler 1
We can see stemDocument does not work here. I notice that there is "character(0)" here which is not shown in the above matrix. But I do not know why?
My situation is I need to do some pre-processing for the text data like stopWords, stemDocument and so on. Then I need to save this processed text to a csv file. So here I cannot directly use TermDocumentMatrix to generate the matrix. Could anyone help me out here? Thanks a lot.
This should help you achieve what you want, I usually convert all the text to lower case, remove punctuation marks etc, before creating the dtm/tdm
library(tm)
txt <- "COLORED PENCIL STAEDTLER NORIS CLUB ASSORTED COLORS PKT12"
txt <- tolower(txt) ## this is the extra step where I have converted eveything to lower case
a1 <- VCorpus(VectorSource(txt))
a2 <- a1 %>% tm_map(stemDocument)
a2 <- a2 %>% TermDocumentMatrix()
inspect(a2)
character(0) appears because of calling PlainTextDocument(). In cases where its necessary to use it , like when you use pass tolower to tm_map and get this error - Error: inherits(doc, "TextDocument") is not TRUE
, use content_transformer.
Hope this helps.