How can you mutate the 10 columns, which contain TRUE if the gene is inside the module and FALSE if it is not?
gene_express = data.frame(gene = c('gene1', 'gene2', 'gene3', 'gene4', 'gene5',
'gene6', 'gene7', 'gene8', 'gene9', 'gene10'), sample1 = sample(0:10,10), sample2 = sample(0:10,10), sample3 = sample(0:10,10), sample4 = sample(0:10,10))
module1 = c('gene1', 'gene2', 'gene10', 'gene8')
module2 = c('gene2', 'gene9', 'gene6', 'gene5', 'gene10')
module3 = c('gene4', 'gene10', 'gene1', 'gene8')
module4 = c('gene5', 'gene8', 'gene2', 'gene7', 'gene6', 'gene5', 'gene10')
module5 = c('gene2', 'gene9', 'gene6', 'gene5', 'gene10')
module6 = c('gene4', 'gene10', 'gene1', 'gene8')
Module_list = list(module1, module2, module3, module4, module5, module6)
names(Module_list) <- c('module1', 'module2', 'module3',
'module4', 'module5', 'module6')
In reality, I have hundreds of these modules, which have been put into a named list of lists, just like my example 'Module_list'. How can I mutate the 'gene_express' data frame such that the module names become new columns containing TRUE if the gene is inside the module and FALSE if not?
The manual way is to specify the module components in the mutate function, as I have here
gene_express %>% mutate(
module1 = case_match(gene, c("gene1", "gene2", "gene8", "gene10") ~ TRUE, .default = FALSE),
module2 = case_match(gene, c("gene2", "gene9", "gene6", "gene5", "gene10") ~ TRUE, .default = FALSE),
module3 = case_match(gene, c("gene4", "gene10", "gene1", "gene8") ~ TRUE, .default = FALSE),
module4 = case_match(gene, c("gene2", "gene9", "gene6", "gene5", "gene10") ~ TRUE, .default = FALSE),
module5 = case_match(gene, c("gene4", "gene10", "gene1", "gene8") ~ TRUE, .default = FALSE),
module6 = case_match(gene, c("gene5", "gene2", "gene7", "gene8", "gene6", "gene10") ~ TRUE, .default = FALSE))
What I want is to avoid manually specifying the module in mutate.
Maybe something like this? Here, I put the list of genes by module into a data frame, then we can join to the original data and fill in the non-joined elements with FALSEs.
library(tidyverse)
Module_df <- Module_list |>
map_dfr(as.data.frame, .id = "module") |> # function from purrr
rename(gene = 2)
gene_express |>
left_join(Module_df |> mutate(val = TRUE)) |>
pivot_wider(names_from = module, values_from = val, # function from tidyr
values_fn = first, values_fill = FALSE)
Result
# A tibble: 10 × 12
gene sample1 sample2 sample3 sample4 module1 module3 module6 module2 module4 module5 `NA`
<chr> <int> <int> <int> <int> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl>
1 gene1 10 0 3 4 TRUE TRUE TRUE FALSE FALSE FALSE FALSE
2 gene2 5 8 5 5 TRUE FALSE FALSE TRUE TRUE TRUE FALSE
3 gene3 8 9 7 2 FALSE FALSE FALSE FALSE FALSE FALSE NA
4 gene4 1 5 9 0 FALSE TRUE TRUE FALSE FALSE FALSE FALSE
5 gene5 4 4 8 3 FALSE FALSE FALSE TRUE TRUE TRUE FALSE
6 gene6 6 10 0 9 FALSE FALSE FALSE TRUE TRUE TRUE FALSE
7 gene7 3 1 1 7 FALSE FALSE FALSE FALSE TRUE FALSE FALSE
8 gene8 2 3 6 6 TRUE TRUE TRUE FALSE TRUE FALSE FALSE
9 gene9 0 2 4 1 FALSE FALSE FALSE TRUE FALSE TRUE FALSE
10 gene10 7 6 2 10 TRUE TRUE TRUE TRUE TRUE TRUE FALSE