Search code examples
rtexttidytext

Calculate `tf-idf` for a data frame of documents


The following code

library(dplyr)
library(janeaustenr)
library(tidytext)
book_words <- austen_books() %>%
  unnest_tokens(word, text) %>%
  count(book, word, sort = TRUE)

book_words <- book_words %>%
  bind_tf_idf(word, book, n)
book_words

taken from Term Frequency and Inverse Document Frequency (tf-idf) Using Tidy Data Principles, estimates the tf-idf in Jane Austen's works. Anyway, this code appears to be specific to Jane Austen's books. I would like to derive, istead, the tf-idf for the following data frame:

sentences<-c("The color blue neutralizes orange yellow reflections.", 
             "Zod stabbed me with blue Kryptonite.", 
             "Because blue is your favourite colour.",
             "Red is wrong, blue is right.",
             "You and I are going to yellowstone.",
             "Van Gogh looked for some yellow at sunset.",
             "You ruined my beautiful green dress.",
             "You do not agree.",
             "There's nothing wrong with green.")

 df=data.frame(text = sentences, 
               class = c("A","B","A","C","A","B","A","C","D"),
               weight = c(1,1,3,4,1,2,3,4,5))

Solution

  • There are two things you needed to change:

    1. since you did not set stringsAsFactors = FALSE when constructing the data.frame, you need to convert text to character first.

    2. You do not have a column named book, which means you have to select some other column as document. Since you put a column named class into your example, I assume you want to calculate the tf-idf over this column.

    Here is the code:

    library(dplyr)
    library(janeaustenr)
    library(tidytext)
    book_words <- df %>%
      mutate(text = as.character(text)) %>% 
      unnest_tokens(output = word, input = text) %>%
      count(class, word, sort = TRUE)
    
    book_words <- book_words %>%
      bind_tf_idf(term = word, document = class, n)
    book_words
    #> # A tibble: 52 x 6
    #>    class word          n     tf   idf tf_idf
    #>    <fct> <chr>     <int>  <dbl> <dbl>  <dbl>
    #>  1 A     blue          2 0.0769 0.288 0.0221
    #>  2 A     you           2 0.0769 0.693 0.0533
    #>  3 C     is            2 0.2    0.693 0.139 
    #>  4 A     and           1 0.0385 1.39  0.0533
    #>  5 A     are           1 0.0385 1.39  0.0533
    #>  6 A     beautiful     1 0.0385 1.39  0.0533
    #>  7 A     because       1 0.0385 1.39  0.0533
    #>  8 A     color         1 0.0385 1.39  0.0533
    #>  9 A     colour        1 0.0385 1.39  0.0533
    #> 10 A     dress         1 0.0385 1.39  0.0533
    #> # ... with 42 more rows
    

    The documentation has helpful remarks for this check out ?count and ?bind_tf_idf.