Search code examples
rtext-mining

How to count occurrences of a word/token in a one-token-per-document-per-row tibble


Hello I have a tibble through a pipe from tidytext::unnest_tokens() and count(category, word, name = "count"). It looks like this example.

owl <- tibble(category = c(0, 1, 2, -1, 0, 1, 2),
              word = c(rep("hello", 3), rep("world", 4)),
              count = sample(1:100, 7))

and I would like to get this tibble with an additional column that gives the number of categories the word appears in, i.e. the same number for each time the word appears.

I tried the following code that works in principal. The result is what I want.

owl %>% mutate(sum_t = sapply(1:nrow(.), function(x) {filter(., word == .$word[[x]]) %>% nrow()}))

However, seeing that my data has 10s of thousands of rows this takes a rather long time. Is there a more efficient way to achieve this?


Solution

  • We could use add_count:

    library(dplyr)
    
     owl %>% 
       add_count(word)
    

    output:

      category word  count     n
         <dbl> <chr> <int> <int>
    1        0 hello    98     3
    2        1 hello    30     3
    3        2 hello    37     3
    4       -1 world    22     4
    5        0 world    80     4
    6        1 world    18     4
    7        2 world    19     4