Search code examples
rlistselectmatchwords

How to find rows which contain words in a given list of words? Not only a certain word, any word in that certain list counts


I have a given list of words, for example:

words <- c("breast","cancer","chemotherapy")

And I have a very large data frame, 1 variable and more than 10,000 entries (rows).

I would like to select all the rows that contain any word in the "words". Not only a certain word, any word in "words" counts. Containing multiple words from "words" counts as well.

If I know what the "words" will be, I could do stringr extractions multiple times. However, the "words" change every time and it could not be seen. Are there any direct ways to do it?

Additionally, could it be possible that I select all rows that contain 2 or more words in "words"? eg. Containing only "cancer" does not count, but containing "breast" and "cancer" counts. Again, the "words" change every time and it could not be seen. Any direct ways?


Solution

  • Some fake data:

    words <- c("breast","cancer","chemotherapy")
    df <- data.frame(v1 = c("there was nothing found","the chemotherapy is effective","no cancer no chemotherapy","the breast looked normal","something"))
    

    You could use a combination of grepl, sapply and rowSums:

    df[rowSums(sapply(words, grepl, df$v1)) > 0, , drop = FALSE]
    

    this results in:

                                 v1
    2 the chemotherapy is effective
    3     no cancer no chemotherapy
    4      the breast looked normal
    

    If want to selct only the rows that have at least two words, then:

    df[rowSums(sapply(words, grepl, df$v1)) > 1, , drop = FALSE]
    

    the result:

                                 v1
    3     no cancer no chemotherapy
    

    NOTE: you need to use drop = FALSE because your dataframe has one variable (column). If your dataframe has more than one variable (columns), then the use of drop = FALSE is not needed.