Search code examples
rtf-idftapplytidytext

bind_tf_idf() error: in tapply(n, documents, sum) : arguments must have same length


I am trying to do bind_tf_idf() for the following df. My df has two documents/classes: Y or N.

> test_2
# A tibble: 3,295 x 2
   Class word    
   <fct> <chr>   
 1 Y     nature
 2 Y     great
 3 Y     are     
 4 Y     present 
 5 N     in      
 6 N     weather   
 7 Y     moisture   
 8 N     humidity     
 9 Y     and     
10 Y     pollen
# … with 3,285 more rows
Warning message:
`...` is not empty.

We detected these problematic arguments:
* `needs_dots`

These dots only exist to allow future extensions and should be empty.
Did you misspecify an argument?

This is what I am using:

test_2_tf_idf <- test_2 %>%
  bind_tf_idf(word, Class, sum)

But I get the error message:

> test_2_tf_idf <- test_2 %>%
+   bind_tf_idf(word, Class, sum)

'Error in tapply(n, documents, sum) : arguments must have same length'

What I ultimately want in the end is a table of calculations analogous to this (disregard "total" column):

#> # A tibble: 40,379 x 7
#>    book              word      n  total     tf   idf tf_idf
#>    <fct>             <chr> <int>  <int>  <dbl> <dbl>  <dbl>
#>  1 Mansfield Park    the    6206 160460 0.0387     0      0
#>  2 Mansfield Park    to     5475 160460 0.0341     0      0
#>  3 Mansfield Park    and    5438 160460 0.0339     0      0
#>  4 Emma              to     5239 160996 0.0325     0      0
#>  5 Emma              the    5201 160996 0.0323     0      0
#>  6 Emma              and    4896 160996 0.0304     0      0
#>  7 Mansfield Park    of     4778 160460 0.0298     0      0
#>  8 Pride & Prejudice the    4331 122204 0.0354     0      0
#>  9 Emma              of     4291 160996 0.0267     0      0
#> 10 Pride & Prejudice to     4162 122204 0.0341     0      0
#> # … with 40,369 more rows

Except in my case the "book" column is analogous to "Y" or "N" class for each word.

What can I do to fix this tapply error?


Solution

  • The fourth argument of tidytext::bind_tf_idf is not a function but a

    Column containing document-term counts as string or symbol (?tidytext::bind_tf_idf)

    Hence you first have to aggregate your data by Class and word using e.g. dplyr::count:

    test_2 <- structure(list(Class = c(
      "Y", "Y", "Y", "Y", "N", "N", "Y", "N",
      "Y", "Y"
    ), word = c(
      "vesicles", "exosomes", "are", "present",
      "in", "blood", "urine", "and", "and", "proteins"
    )), class = "data.frame", row.names = c(
      "1",
      "2", "3", "4", "5", "6", "7", "8", "9", "10"
    ))
    
    library(tidytext)
    library(dplyr)
    
    test_2_tf_idf <- test_2 %>%
      count(word, Class) %>%
      bind_tf_idf(word, Class, n)
    
    test_2_tf_idf
    #>        word Class n        tf       idf     tf_idf
    #> 1       and     N 1 0.3333333 0.0000000 0.00000000
    #> 2       and     Y 1 0.1428571 0.0000000 0.00000000
    #> 3       are     Y 1 0.1428571 0.6931472 0.09902103
    #> 4     blood     N 1 0.3333333 0.6931472 0.23104906
    #> 5  exosomes     Y 1 0.1428571 0.6931472 0.09902103
    #> 6        in     N 1 0.3333333 0.6931472 0.23104906
    #> 7   present     Y 1 0.1428571 0.6931472 0.09902103
    #> 8  proteins     Y 1 0.1428571 0.6931472 0.09902103
    #> 9     urine     Y 1 0.1428571 0.6931472 0.09902103
    #> 10 vesicles     Y 1 0.1428571 0.6931472 0.09902103