The following code
library(dplyr)
library(janeaustenr)
library(tidytext)
book_words <- austen_books() %>%
unnest_tokens(word, text) %>%
count(book, word, sort = TRUE)
book_words <- book_words %>%
bind_tf_idf(word, book, n)
book_words
taken from Term Frequency and Inverse Document Frequency (tf-idf) Using Tidy Data Principles, estimates the tf-idf
in Jane Austen's works. Anyway, this code appears to be specific to Jane Austen's books. I would like to derive, istead, the tf-idf
for the following data frame:
sentences<-c("The color blue neutralizes orange yellow reflections.",
"Zod stabbed me with blue Kryptonite.",
"Because blue is your favourite colour.",
"Red is wrong, blue is right.",
"You and I are going to yellowstone.",
"Van Gogh looked for some yellow at sunset.",
"You ruined my beautiful green dress.",
"You do not agree.",
"There's nothing wrong with green.")
df=data.frame(text = sentences,
class = c("A","B","A","C","A","B","A","C","D"),
weight = c(1,1,3,4,1,2,3,4,5))
There are two things you needed to change:
since you did not set stringsAsFactors = FALSE
when constructing the data.frame
, you need to convert text
to character first.
You do not have a column named book
, which means you have to select some other column as document
. Since you put a column named class
into your example, I assume you want to calculate the tf-idf over this column.
Here is the code:
library(dplyr)
library(janeaustenr)
library(tidytext)
book_words <- df %>%
mutate(text = as.character(text)) %>%
unnest_tokens(output = word, input = text) %>%
count(class, word, sort = TRUE)
book_words <- book_words %>%
bind_tf_idf(term = word, document = class, n)
book_words
#> # A tibble: 52 x 6
#> class word n tf idf tf_idf
#> <fct> <chr> <int> <dbl> <dbl> <dbl>
#> 1 A blue 2 0.0769 0.288 0.0221
#> 2 A you 2 0.0769 0.693 0.0533
#> 3 C is 2 0.2 0.693 0.139
#> 4 A and 1 0.0385 1.39 0.0533
#> 5 A are 1 0.0385 1.39 0.0533
#> 6 A beautiful 1 0.0385 1.39 0.0533
#> 7 A because 1 0.0385 1.39 0.0533
#> 8 A color 1 0.0385 1.39 0.0533
#> 9 A colour 1 0.0385 1.39 0.0533
#> 10 A dress 1 0.0385 1.39 0.0533
#> # ... with 42 more rows
The documentation has helpful remarks for this check out ?count
and ?bind_tf_idf
.