Search code examples
rstringreplacelapplystringtokenizer

Replace words from list of words


I have this data frame

df <- structure(list(ID = 1:3, Text = c("there was not clostridium", "clostridium difficile positive", "test was OK but there was clostridium")), class = "data.frame", row.names = c(NA, -3L)) 
 ID                                  Text
1  1             there was not clostridium
2  2        clostridium difficile positive
3  3 test was OK but there was clostridium

And pattern of stop words

stop <- paste0(c("was", "but", "there"), collapse = "|")

I would like to go through the Text from ID and remove words from stop pattern It is important to keep order of words. I do not want to use merge functions.

I have tried this

  df$Words <- tokenizers::tokenize_words(df$Text, lowercase = TRUE) ##I would like to make a list of single words

for (i in length(df$Words)){
  
  df$clean <- lapply(df$Words, function(y) lapply(1:length(df$Words[i]),
                                                 function(x) stringr::str_replace(unlist(y) == x, stop, "REPLACED")))
  
  
}

But this gives me a vector of logical string not a list of words.

> df
  ID                                  Text                                       Words                                           clean
1  1             there was not clostridium                there, was, not, clostridium                      FALSE, FALSE, FALSE, FALSE
2  2        clostridium difficile positive            clostridium, difficile, positive                             FALSE, FALSE, FALSE
3  3 test was OK but there was clostridium test, was, ok, but, there, was, clostridium FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE

I would like to get this (replace all words from stop pattern and keep word order)

> df
  ID                                  Text                                       Words                                           clean
1  1             there was not clostridium                there, was, not, clostridium                      "REPLACED", "REPLACED", not, clostridium
2  2        clostridium difficile positive            clostridium, difficile, positive                             clostridium, difficile, positive
3  3 test was OK but there was clostridium test, was, ok, but, there, was, clostridium test, "REPLACED", OK, "REPLACED", "REPLACED", "REPLACED", clostridium

Solution

  • Tidyverse solution :

    First, you need to modify the stop vector so i contains \b before and after the stop word. \b = word boundary and avoid removing the patterns accidentally from within words.

    library(stringr)
    library(dplyr)
    
    stop <- paste0(c("\\bwas\\b", "\\bbut\\b", "\\bther\\b"), collapse = "|")
    

    Then remove with str_remove_all. However, this will leave doble whitespaces, which can be removed with str_replace_all and change two whitespaces with one.

    df %>% mutate(Words = str_remove_all(Text, stop)) %>%
           mutate(Words = str_replace_all(Words, "\\s{2}", " "))
    

    This yields the following results (added a "I was bit by a wasp" to check it didn't erase it.

    # A tibble: 4 x 3
         ID Text                                  Words                         
      <int> <chr>                                 <chr>                         
    1     1 there was not clostridium             there not clostridium         
    2     2 clostridium difficile positive        clostridium difficile positive
    3     3 test was OK but there was clostridium test OK there clostridium     
    4     4 I was bit by a wasp                   I bit by a wasp