Search code examples
rdataframecountfull-text-search

Counting keywords per pages in dataframe created by keyword_search


library(pdfsearch)
Characters <- c("Ben", "John")
keyword_search('location of file', 
               keyword = Characters,
               path = TRUE)


     keyword page_num

1      Ben    1
2      Ben    1
3     John    1
4     John    2

How can i make R count all my keywords on every page_num, creating a dataframe like:

      name   page  count
1      Ben    1      2
2     John    1      1
3     John    2      1

I know nrow function but is there a faster way?

nrow(dataframe[dataframe$keyword == "Ben" & dataframe$page_num == 1, ])

Solution

  • Base R supports a wide variety of ways to perform grouped operations (probably too many, as it makes choosing the appropriate method harder):

    my_data <- data.frame(name = c("Ben", "Ben", "John", "John"), page_num = c(1,1,1,2))
    
    > test
      name page_num
    1  Ben        1
    2  Ben        1
    3 John        1
    4 John        2
    
    
    # table()
    
    > table(my_data)
          page_num
    name   1 2
      Ben  2 0
      John 1 1
    
    > as.data.frame(table(my_data))
      name page_num Freq
    1  Ben        1    2
    2 John        1    1
    3  Ben        2    0
    4 John        2    1
    
    # xtabs
    
    > xtabs(~ name + page_num, data = test)
    
          page_num
    name   1 2
      Ben  2 0
      John 1 1
    
    > as.data.frame(xtabs(~ name + page_num, data = my_data))
      name page_num Freq
    1  Ben        1    2
    2 John        1    1
    3  Ben        2    0
    4 John        2    1
    

    Other functions for performing grouped operations include by(), tapply(), ave() and more.

    The popular dplyr package also has a syntax for performing grouped operations on data.frame objects without transformation:

    library(dplyr)
    
    # `group_by()`, `mutate()`, `%>%`, and `n()` are exports from `dplyr`
    my_data %>%
      group_by(name, page_number) %>%
      mutate(count = n())
      # n() is a dplyr operator that is mechanically identical to length()