Search code examples
rtext-mining

Indicate in df columns occurrence of keywords of another column in R


Assuming a data frame df with several columns, of which a "description" field,

and assuming a set of keywords stored within a separate vector keywords, what is the best practice to:

  • create columns in df for each of these keywords, named accordingly ;
  • and storing the count of its own number of occurrences within the "description" field of df?

e.g.

(df <- data.frame(
  ID = letters[1:10],
  DESCRIPTION = c("blue", "red", "this was green", "this was red", "blue and red", "green", NA, "green", "green, blue, and red", NA)
))
   ID          DESCRIPTION
1   a                 blue
2   b                  red
3   c       this was green
4   d         this was red
5   e         blue and red
6   f                green
7   g                 <NA>
8   h                green
9   i green, blue, and red
10  j                 <NA>
keywords <- c("blue", "red", "green")

would return

   ID          DESCRIPTION blue red green
1   a                 blue    1   0     0
2   b                  red    0   1     0
3   c       this was green    0   0     1
4   d         this was red    0   1     0
5   e         blue and red    1   1     0
6   f                green    0   0     1
7   g                 <NA>    0   0     0
8   h                green    0   0     1
9   i green, blue, and red    1   1     1
10  j                 <NA>    0   0     0

Prefereable using base R or dplyr (eg. avoiding data.table).

Note: the answer needs to be scalable (many possible keywords).


Solution

  • An approach using unnest, str_count and pivot_wider, assuming all occurrences of the keywords within all strings per row have to be counted. Using a slightly modified data set to show multiple counts

    library(dplyr)
    library(tidyr)
    library(stringr)
    
    df %>% 
      mutate(nms = list(!!keywords)) %>% 
      unnest(nms) %>% 
      rowwise() %>% 
      mutate(values = str_count(DESCRIPTION, nms), 
             values = replace(values, is.na(values), 0)) %>% 
      ungroup() %>% 
      pivot_wider(names_from=nms, values_from=values)
    # A tibble: 10 × 5
       ID    DESCRIPTION           blue   red green
       <chr> <chr>                <dbl> <dbl> <dbl>
     1 a     blue blue                2     0     0
     2 b     red                      0     1     0
     3 c     this was green           0     0     1
     4 d     this was red             0     1     0
     5 e     blue and red             1     1     0
     6 f     green                    0     0     1
     7 g     NA                       0     0     0
     8 h     green                    0     0     1
     9 i     green, blue, and red     1     1     1
    10 j     NA                       0     0     0
    

    or with base R sapply and str_count

    library(stringr)
    
    cbind(df, sapply(keywords, function(x){
      res <- str_count(df$DESCRIPTION, x)
      replace(res, is.na(res), 0)}))
       ID          DESCRIPTION blue red green
    1   a            blue blue    2   0     0
    2   b                  red    0   1     0
    3   c       this was green    0   0     1
    4   d         this was red    0   1     0
    5   e         blue and red    1   1     0
    6   f                green    0   0     1
    7   g                 <NA>    0   0     0
    8   h                green    0   0     1
    9   i green, blue, and red    1   1     1
    10  j                 <NA>    0   0     0
    

    mod. data

    df <- structure(list(ID = c("a", "b", "c", "d", "e", "f", "g", "h", 
    "i", "j"), DESCRIPTION = c("blue blue", "red", "this was green", 
    "this was red", "blue and red", "green", NA, "green", "green, blue, and red", 
    NA)), row.names = c(NA, -10L), class = "data.frame")