Search code examples
rtm

How do I set up TF weight of terms in corpus using the ‘tm’ package in R


I wonder how can I get the term frequency weight in tm packge which is (tf=term/total terms in the document)`

MyMatrix <- DocumentTermMatrix(a, control = list(weight= weightTf))

After I use this weight it shows the frequency of term not TF weight like this

Doc(1)  1   0   0   3   0   0   2
Doc(2)  0   0   0   0   0   0   0
Doc(3)  0   5   0   0   0   0   1
Doc(4)  0   0   0   2   2   0   0
Doc(5)  0   4   0   0   0   0   1
Doc(6)  5   0   0   0   1   0   0
Doc(7)  0   5   0   0   0   0   0
Doc(8)  0   0   0   1   0   0   7

Solution

  • For example

    library(tm)
    corp <- Corpus(VectorSource(c(doc1="hello world", doc2="hello new world")))
    myfun <-  WeightFunction(function(m) { 
      cs <- slam::col_sums(m) 
      m$v <- m$v/cs[m$j] 
      return(m) 
    }, "Term Frequency by Total Document Term Frequency", "termbytot") 
    dtm <- DocumentTermMatrix(corp, control = list(weighting = myfun))
    inspect(dtm)
    # <<DocumentTermMatrix (documents: 2, terms: 3)>>
    # Non-/sparse entries: 5/1
    # Sparsity           : 17%
    # Maximal term length: 5
    # 
    #     Terms
    # Docs     hello       new     world
    #    1 0.5000000 0.0000000 0.5000000
    #    2 0.3333333 0.3333333 0.3333333