Subsetting a data frame containing factors, NA values, and wildcards

So I have a large data frame with several different categories, a simplified example is below (The true dataset has 10+ different Tissues, 15+ different unique celltypes with variable length names per tissue, and thousands of genes). The Tissue columns are formatted as factors.

GENENAME    Tissue1     Tissue2     Tissue3
Gene1       CellType_AA CellType_BB CellType_G
Gene2       CellType_AA CellType_BB       <NA>
Gene3       CellType_AA       <NA>        <NA>
Gene4       CellType_AA CellType_BB CellType_G
Gene5             <NA>        <NA>  CellType_G
Gene6             <NA>  CellType_BB CellType_H
Gene7       CellType_AC CellType_BD CellType_H
Gene8             <NA>        <NA>  CellType_H
Gene9       CellType_AC CellType_BD       <NA>
Gene10            <NA>  CellType_BB       <NA>
Gene11            <NA>  CellType_BD CellType_H
Gene12      CellType_AC       <NA>        <NA>
Gene13            <NA>  CellType_E  CellType_I
Gene14      CellType_F  CellType_E  CellType_I
Gene15      CellType_F  CellType_E        <NA>

What I am trying to do is return a subset based on CellTypes present in multiple tissues, and ignore unnecessary columns when I do so. Additionally, I want to use wildcards (in the the example below, CellType_A*, in order to pick up both CellType_AA and CellType_AB), and ignore the other columns when I only specify some of the columns. I want the function to be easily reusable for different combinations of celltypes, so added a seperate variable for each column.

To do this I set up the function below, setting the default value of each variable as "*", thinking that then it would treat any of those columns as valid if I don't specify an input.

Find_CoEnrich <- function(T1="*", T2="*", T3="*"){
  subset(dataset, 
         grepl(T1, dataset$Tissue1)
         &grepl(T2, dataset$Tissue2)
         &grepl(T3, dataset$Tissue3)
         ,select = GENENAME
  )  
}

However when I run the function on only a single column, to test it

Find_CoEnrich(T1="CellType_AA")

It will return only the following:

   GENENAME
1     Gene1
4     Gene4

instead of

1     Gene1
2     Gene2
3     Gene3
4     Gene4

Skipping any rows which contain an NA in another column. Even more mysteriously, if I try with the wildcard, it seemingly ignores the rest of the string and just returns only those rows which have values in every row, even if they don't match the rest of the string, sich as Gene14:

Find_CoEnrich(T1="CellType_A*")

   GENENAME
1     Gene1
4     Gene4
7     Gene7
14   Gene14

I am pretty sure it is the presence of the NA's in the table that is causing problems, but have spent a long time trying to correct this and am running out of patience. If anyone can help it would be much appreciated.

Solution

The wildcard character * you intend to use has a specific meaning as a regular expression, which is how you tell grepl which values to accept - it means 0 or more repetitions of the preceding character. Also, I believe you want a boolean OR (|) operation between the grepl expressions, since you want any row where one of the columns matches the pattern.

Here's a perhaps simpler solution using tidyverse, using separate 'row-based filtering' and 'column selection' steps:

library(tidyverse)

dataset <-  # small subset of your data, rows 1-4 should match but not 5
  tribble(
    ~GENENAME,    ~Tissue1,     ~Tissue2,     ~Tissue3,
    "Gene1", "CellType_AA", "CellType_BB", "CellType_G",
    "Gene2", "CellType_AA", "CellType_BB", NA,
    "Gene3", "CellType_AA", NA, NA,
    "Gene4", "CellType_AA", "CellType_BB", "CellType_G",
    "Gene5", NA, NA, "CellType_G"
    )

desired_pattern <- "CellType_A"  # note that this already implies that any other character can follow, e.g. this will match CellType_AA, CellType_AB, etc.

dataset %>%
  select(all_of(c("GENENAME","Tissue1","Tissue2","Tissue3"))) %>%  # the column selection
  filter(if_any(  # this is a tad confusing: return the row if any of the specified columns matches the condition...
    .cols = all_of(c("Tissue1", "Tissue2", "Tissue3")),  # specify which columns to check
    .fns = ~ stringr::str_detect(.x, pattern = desired_pattern)  # specify the condition...str_detect() is basically grepl() under the hood
  ))

To change to matched cell types beginning with A or B, you could change the pattern accordingly:

desired_pattern  <- ""  # this will match any cell type that starts with A or B

EDIT:

To find rows that match BOTH CellType_A in one of the columns and CellType_B in another, you can do two successive filter steps:

dataset %>%
  select(all_of(c("GENENAME","Tissue1","Tissue2","Tissue3"))) %>%  # the column selection
  filter(if_any(  # in this step, keep only rows that contain at least one `CellType_A`
    .cols = all_of(c("Tissue1", "Tissue2", "Tissue3")),  # specify which columns to check
    .fns = ~ stringr::str_detect(.x, pattern = "CellType_A")
  )) %>%
  filter(if_any(  # in this step, keep only rows that contain at least one `CellType_B`
    .cols = all_of(c("Tissue1", "Tissue2", "Tissue3")),  # specify which columns to check
    .fns = ~ stringr::str_detect(.x, pattern = "CellType_B")
  ))

The order of the two filtering steps above doesn't matter (and you can try swapping them round to convince yourself!)