Search code examples
rtm

stemDocument works in TermDocumentMatrix but does not work in tm_map using tm and R


Let's say there is a string "COLORED PENCIL STAEDTLER NORIS CLUB ASSORTED COLORS PKT12". My code is:

> a1 <- VCorpus(VectorSource("COLORED PENCIL STAEDTLER NORIS CLUB ASSORTED COLORS PKT12"))
> a3 <- TermDocumentMatrix(a1,control = list(stemming=T))

The matrix is:

           Docs
Terms       1
  assort    1
  club      1
  color     2
  nori      1
  pencil    1
  pkt12     1
  staedtler 1

So we can see stemDocument works for colored and colors, both of which turned to be color. However, if I do:

> a1 <- VCorpus(VectorSource("COLORED PENCIL STAEDTLER NORIS CLUB ASSORTED COLORS PKT12"))
> a2 <- a1 %>% tm_map(PlainTextDocument) %>% tm_map(stemDocument,"english")
> a2[[1]]$content
[1] "COLORED PENCIL STAEDTLER NORIS CLUB ASSORTED COLORS PKT12"
> a2 <- a2 %>% TermDocumentMatrix()

The matrix is:

           Docs
Terms       character(0)
  assorted             1
  club                 1
  colored              1
  colors               1
  noris                1
  pencil               1
  pkt12                1
  staedtler            1

We can see stemDocument does not work here. I notice that there is "character(0)" here which is not shown in the above matrix. But I do not know why?

My situation is I need to do some pre-processing for the text data like stopWords, stemDocument and so on. Then I need to save this processed text to a csv file. So here I cannot directly use TermDocumentMatrix to generate the matrix. Could anyone help me out here? Thanks a lot.


Solution

  • This should help you achieve what you want, I usually convert all the text to lower case, remove punctuation marks etc, before creating the dtm/tdm

    library(tm)
    txt <- "COLORED PENCIL STAEDTLER NORIS CLUB ASSORTED COLORS PKT12"
    
    txt <- tolower(txt) ## this is the extra step where I have converted eveything to lower case 
    
    a1 <- VCorpus(VectorSource(txt))
    a2 <- a1 %>%  tm_map(stemDocument) 
    a2 <- a2 %>% TermDocumentMatrix()
    inspect(a2)
    

    character(0) appears because of calling PlainTextDocument(). In cases where its necessary to use it , like when you use pass tolower to tm_map and get this error - Error: inherits(doc, "TextDocument") is not TRUE, use content_transformer.

    Hope this helps.