Search code examples
rtmterm-document-matrix

R: DocumentTermMatrix Wrong Frequencies after mgsub


I have a DocumentTermMatrix and I´d like to replace specific terms in this document and to create a frequency table.

The starting point is the original document as follows:

library(tm)
library(qdap)

    df1 <- data.frame(word =c("test", "test", "teste", "hey", "heyyy", "hi"))
    tdm <- as.DocumentTermMatrix(as.character(df1$word))

When I create a frequency table of the original document I get the correct results:

freq0 <- as.matrix(sort(colSums(as.matrix(tdm)), decreasing=TRUE))
freq0 

So far so good. However, if replace some terms in the document then the new frequency table gets wrong results:

    tdm$dimnames$Terms <- mgsub(c("teste", "heyyy"), c("test", "hey"), as.character(tdm$dimnames$Terms), fixed=T, trim=T)
    freq1 <- as.matrix(sort(colSums(as.matrix(tdm)), decreasing=TRUE))
    freq1

Obviously or perhaps some indexing in the document is wrong because even same terms are not regarded as identical while counting the terms.

This outcome should be the ideal case:

df2 <- data.frame(word =c("test", "test", "test", "hey", "hey", "hi"))
tdm2 <- as.DocumentTermMatrix(as.character(df2$word))
tdm2$dimnames$Terms <- mgsub(c("teste", "heyyy"), c("test", "hey"), as.character(tdm2$dimnames$Terms), fixed=T, trim=T)
freq2 <- as.matrix(sort(colSums(as.matrix(tdm2)), decreasing=TRUE))
freq2

Can anyone help me to figure out the problem?

Thx in advance


Solution

  • We can look at the structure of as.matrix(tdm)

    str(as.matrix(tdm))
    #num [1, 1:5] 1 1 1 2 1
    # - attr(*, "dimnames")=List of 2
    #  ..$ Docs : chr "all"
    # ..$ Terms: chr [1:5] "hey" "heyyy" "hi" "test" ...
    

    which is one row, 5 column matrix, so colSums is basically not doing anything.

    xtabs(as.vector(tdm)~tdm$dimnames$Terms)
    #tdm$dimnames$Terms
    #  hey heyyy    hi  test teste 
    #   1     1     1     2     1 
    

    and after replacing using mgsub

    xtabs(as.vector(tdm)~tdm$dimnames$Terms)
    #tdm$dimnames$Terms
    # hey   hi test 
    #  2    1    3 
    

    The xtabs does the sum of the vector. This can also be done with tapply

     tapply(as.vector(tdm), tdm$dimnames$Terms, FUN = sum)
    

    If the number of rows are greater than 1, we can use colSums

     tapply(colSums(as.matrix(tdm)),  tdm$dimnames$Terms, FUN = sum)
     # hey   hi test 
     #  4    2    6 
    

    NOTE: The above output is after we made the changes with mgsub