So I have a large data frame with several different categories, a simplified example is below (The true dataset has 10+ different Tissues, 15+ different unique celltypes with variable length names per tissue, and thousands of genes). The Tissue columns are formatted as factors.
GENENAME Tissue1 Tissue2 Tissue3
Gene1 CellType_AA CellType_BB CellType_G
Gene2 CellType_AA CellType_BB <NA>
Gene3 CellType_AA <NA> <NA>
Gene4 CellType_AA CellType_BB CellType_G
Gene5 <NA> <NA> CellType_G
Gene6 <NA> CellType_BB CellType_H
Gene7 CellType_AC CellType_BD CellType_H
Gene8 <NA> <NA> CellType_H
Gene9 CellType_AC CellType_BD <NA>
Gene10 <NA> CellType_BB <NA>
Gene11 <NA> CellType_BD CellType_H
Gene12 CellType_AC <NA> <NA>
Gene13 <NA> CellType_E CellType_I
Gene14 CellType_F CellType_E CellType_I
Gene15 CellType_F CellType_E <NA>
What I am trying to do is return a subset based on CellTypes present in multiple tissues, and ignore unnecessary columns when I do so. Additionally, I want to use wildcards (in the the example below, CellType_A*
, in order to pick up both CellType_AA
and CellType_AB
), and ignore the other columns when I only specify some of the columns. I want the function to be easily reusable for different combinations of celltypes, so added a seperate variable for each column.
To do this I set up the function below, setting the default value of each variable as "*"
, thinking that then it would treat any of those columns as valid if I don't specify an input.
Find_CoEnrich <- function(T1="*", T2="*", T3="*"){
subset(dataset,
grepl(T1, dataset$Tissue1)
&grepl(T2, dataset$Tissue2)
&grepl(T3, dataset$Tissue3)
,select = GENENAME
)
}
However when I run the function on only a single column, to test it
Find_CoEnrich(T1="CellType_AA")
It will return only the following:
GENENAME
1 Gene1
4 Gene4
instead of
1 Gene1
2 Gene2
3 Gene3
4 Gene4
Skipping any rows which contain an NA
in another column. Even more mysteriously, if I try with the wildcard, it seemingly ignores the rest of the string and just returns only those rows which have values in every row, even if they don't match the rest of the string, sich as Gene14
:
Find_CoEnrich(T1="CellType_A*")
GENENAME
1 Gene1
4 Gene4
7 Gene7
14 Gene14
I am pretty sure it is the presence of the NA
's in the table that is causing problems, but have spent a long time trying to correct this and am running out of patience. If anyone can help it would be much appreciated.
The wildcard character *
you intend to use has a specific meaning as a regular expression, which is how you tell grepl
which values to accept - it means 0 or more repetitions of the preceding character. Also, I believe you want a boolean OR
(|
) operation between the grepl
expressions, since you want any row where one of the columns matches the pattern.
Here's a perhaps simpler solution using tidyverse
, using separate 'row-based filtering' and 'column selection' steps:
library(tidyverse)
dataset <- # small subset of your data, rows 1-4 should match but not 5
tribble(
~GENENAME, ~Tissue1, ~Tissue2, ~Tissue3,
"Gene1", "CellType_AA", "CellType_BB", "CellType_G",
"Gene2", "CellType_AA", "CellType_BB", NA,
"Gene3", "CellType_AA", NA, NA,
"Gene4", "CellType_AA", "CellType_BB", "CellType_G",
"Gene5", NA, NA, "CellType_G"
)
desired_pattern <- "CellType_A" # note that this already implies that any other character can follow, e.g. this will match CellType_AA, CellType_AB, etc.
dataset %>%
select(all_of(c("GENENAME","Tissue1","Tissue2","Tissue3"))) %>% # the column selection
filter(if_any( # this is a tad confusing: return the row if any of the specified columns matches the condition...
.cols = all_of(c("Tissue1", "Tissue2", "Tissue3")), # specify which columns to check
.fns = ~ stringr::str_detect(.x, pattern = desired_pattern) # specify the condition...str_detect() is basically grepl() under the hood
))
To change to matched cell types beginning with A or B, you could change the pattern accordingly:
desired_pattern <- "" # this will match any cell type that starts with A or B
EDIT:
To find rows that match BOTH CellType_A
in one of the columns and CellType_B
in another, you can do two successive filter steps:
dataset %>%
select(all_of(c("GENENAME","Tissue1","Tissue2","Tissue3"))) %>% # the column selection
filter(if_any( # in this step, keep only rows that contain at least one `CellType_A`
.cols = all_of(c("Tissue1", "Tissue2", "Tissue3")), # specify which columns to check
.fns = ~ stringr::str_detect(.x, pattern = "CellType_A")
)) %>%
filter(if_any( # in this step, keep only rows that contain at least one `CellType_B`
.cols = all_of(c("Tissue1", "Tissue2", "Tissue3")), # specify which columns to check
.fns = ~ stringr::str_detect(.x, pattern = "CellType_B")
))
The order of the two filtering steps above doesn't matter (and you can try swapping them round to convince yourself!)