Search code examples
rtext-processingtmtf-idf

dtm sparsity different depending on tf/tfidf , same corpus


Can anybody explain?

My understanding:

tf >= 0 (absolute frequency value)

tfidf >= 0 (for negative idf, tf=0)



sparse entry = 0

nonsparse entry > 0

So the exact sparse/nonsparse proportion should be the same in the two DTMs created with the code below.

library(tm)
data(crude)

dtm <- DocumentTermMatrix(crude, control=list(weighting=weightTf))
dtm2 <- DocumentTermMatrix(crude, control=list(weighting=weightTfIdf))
dtm
dtm2

But:

> dtm
<<DocumentTermMatrix (documents: 20, terms: 1266)>>
**Non-/sparse entries: 2255/23065**
Sparsity           : 91%
Maximal term length: 17
Weighting          : term frequency (tf)
> dtm2
<<DocumentTermMatrix (documents: 20, terms: 1266)>>
**Non-/sparse entries: 2215/23105**
Sparsity           : 91%
Maximal term length: 17
Weighting          : term frequency - inverse document frequency (normalized) (tf-idf)

Solution

  • The sparsity can differ. The TF-IDF value will be zero if TF is zero or if IDF is zero, and IDF is zero if a term occurs in every document. Consider the following example:

    txts <- c("super World", "Hello World", "Hello super top world")
    library(tm)
    tf <- TermDocumentMatrix(Corpus(VectorSource(txts)), control=list(weighting=weightTf))
    tfidf <- TermDocumentMatrix(Corpus(VectorSource(txts)), control=list(weighting=weightTfIdf))
    
    inspect(tf)
    # <<TermDocumentMatrix (terms: 4, documents: 3)>>
    # Non-/sparse entries: 8/4
    # Sparsity           : 33%
    # Maximal term length: 5
    # Weighting          : term frequency (tf)
    # 
    #        Docs
    # Terms   1 2 3
    #   hello 0 1 1
    #   super 1 0 1
    #   top   0 0 1
    #   world 1 1 1
    
    inspect(tfidf)
    # <<TermDocumentMatrix (terms: 4, documents: 3)>>
    # Non-/sparse entries: 5/7
    # Sparsity           : 58%
    # Maximal term length: 5
    # Weighting          : term frequency - inverse document frequency (normalized) (tf-idf)
    # 
    #        Docs
    # Terms           1         2         3
    #   hello 0.0000000 0.2924813 0.1462406
    #   super 0.2924813 0.0000000 0.1462406
    #   top   0.0000000 0.0000000 0.3962406
    #   world 0.0000000 0.0000000 0.0000000
    

    The term super occurs 1 time in document 1, which has 2 terms, and it occurs in 2 out of 3 documents:

    1/2 * log2(3/2)
    # [1] 0.2924813
    

    The term world occurs 1 time in document 3, which has 4 terms, and it occurs in all 3 documents:

    1/4 * log2(3/3) # 1/4 * 0
    # [1] 0