Search code examples
rstringstring-matching

Matching text tokens with list of word


I would like to match words from list of words with text and extract them to a new column.

I have this data

   df <- structure(list(ID = 1:3, Text = c(list("red car, car going, going to"),   list("red ball, ball on, on street"), list("to be, be or, or not"))), class = "data.frame", row.names = c(NA, -3L))


  ID                         Text
1  1 red car, car going, going to
2  2 red ball, ball on, on street
3  3         to be, be or, or not

And I this list of important words

words <- c("car", "ball", "street", "dog", "frog")

I would like df like this

  ID                         Text  Word
1  1 red car, car going, going to  c("car","car")
2  2 red ball, ball on, on street  c("ball", "ball", "street")
3  3         to be, be or, or not  NA

My try

df$Word <- lapply(df$Text, function(x)  stringr::str_extract_all(x, "\\b"%s+%words+%"\\b"))

But it gives me a list of length 5 not only the words from Text.


Solution

  • A possible solution:

    library(tidyverse)
    
    df <- data.frame(
      stringsAsFactors = FALSE,
      ID = c(1L, 2L, 3L),
      Text = c("red car, car going, going to","red ball, ball on, on street",
               "to be, be or, or not")
    )
    
    words <- c("car", "ball", "street", "dog", "frog")
    
    df %>%
      mutate(word = Text) %>% 
      separate_rows(word, sep = ",|\\s") %>% 
      mutate(word = ifelse(word %in% words, word, NA)) %>% 
      drop_na(word) %>% 
      group_by(ID) %>% 
      summarise(word = str_c(word, collapse = ", "), .groups = "drop") %>%  
      left_join(df,., by=c("ID"))
    
    #>   ID                         Text               word
    #> 1  1 red car, car going, going to           car, car
    #> 2  2 red ball, ball on, on street ball, ball, street
    #> 3  3         to be, be or, or not               <NA>