Search code examples
rdataframedplyr

How to match patterns in a column of longer strings using dplyr


I am trying to pattern-match all my targeted tissues ("heart", "muscle", "kidney", "liver") in a data frame (pasted below) and list the name of species that have all of the targeted tissues.

Data:

df <- read.csv(text =
"Species,Tissue
Human,Kr_liver_2
Human,Heart
Human,Liver_556
Human,Kr_Kidney_2
Human,Kr_Muscle_2
Human,Kr_Brain_2
Mouse,Brain
Mouse,Kr_liver_3
Mouse,Kr_liver_5
Mouse,Kr_liver_27")

I tried the approach below but I got an empty output, however, the desired output based on the data frame above should be 'Human' because it has all of the targetted tissues.

Tissue_check <- df %>%
  group_by(Species) %>%
  filter(all(grepl(paste(target_tissues, collapse = "|"), tolower(Tissue)))) %>%
  pull(Species) %>%
  unique()

How can I achieve this?


Solution

  • You can paste all elements of Tissue column into one string, and detect if all of the target tissues are included in it.

    library(dplyr)
    
    target <- c("heart", "muscle", "kidney", "liver")
    
    df %>%
      filter(all(sapply(target, grepl, toString(Tissue), ignore.case = TRUE)),
             .by = Species)
    

    An alternative with stringr:

    library(stringr)
    
    df %>%
      filter(all(str_detect(toString(Tissue), fixed(target, ignore_case = TRUE))),
             .by = Species)
    
    Output
    #   Species      Tissue
    # 1   Human  Kr_liver_2
    # 2   Human       Heart
    # 3   Human   Liver_556
    # 4   Human Kr_Kidney_2
    # 5   Human Kr_Muscle_2
    # 6   Human  Kr_Brain_2