Search code examples
rnlptm

What does support feature mean in result of function "term_stats()" from package "tm" in R and how is it different from count?


Running following script will produce the results

a <- c("Your work is going to fill a large part of your life, and the only way to be truly satisfied is to do what you believe is great work. And the only way to do great work is to love what you do. If you haven't found it yet, keep looking. Don't settle. As with all matters of the heart, you'll know when you find it. - Steve Jobs")
a_source <- VectorSource(a)
a_corpus <- VCorpus(a_source)
term_stats(a_corpus)
term_stats(a_corpus)

       term    count   support
    1  .         5       1
    2  to        5       1
    3  is        4       1
    4  you       4       1
    5  ,         3       1

Solution

  • Support is the number of documents where the word occurs, count is the number of occurrences. You need both if doing tf-idf.

    library(tm)
    
    txt <- c("Your work is going to fill a large part of your life, 
           and the only way to be truly satisfied is to do what you
            believe is great work. 
           And the only way to do great work is to love what you do. 
           If you haven't found it yet, keep looking. Don't settle. 
           As with all matters of the heart, you'll know when you find it. 
           - Steve Jobs")
    
    term_stats(VCorpus(VectorSource(txt)))[1:5,]
    
    term count support
    .        5       1
    to       5       1
    is       4       1
    
    
    #Split txt into 4 docs
    txt_df <- data.frame( txt = c(
    "Your work is going to fill a large part of your life, 
     and the only way to be truly satisfied is to do what you 
     believe is great work." , 
     "And the only way to do great work is to love what you do." , 
     "If you haven't found it yet, keep looking. Don't settle." , 
     "As with all matters of the heart, you'll know when you find it. - 
     Steve Jobs"))
    
    term_stats(VCorpus(VectorSource(txt_df$txt)))[1:6,]
    
    term count support
    .        5       4
    you      4       4
    ,        3       3
    the      3       3
    to       5       2
    is       4       2
    

    Default is to sort by support.