I have a large dataframe of transcript IDs and their associated gene names, as well as a column that indicates if a certain model reported that gene/transcript as significant. Every transcript will always be linked to a single gene, although each gene is associated with multiple transcripts. Here's a reproducible example:
transcript <- c('t1', 't2', 't3', 't4', 't5', 't6', 't7','t8', 't9', 't10')
gene <- c('g1', 'g1', 'g1', 'g2', 'g2', 'g2', 'g3','g3', 'g3', 'g3')
model1 <- c('TRUE', 'TRUE', 'TRUE', 'FALSE', 'FALSE', 'FALSE', 'TRUE', 'TRUE', 'TRUE', 'TRUE')
model2 <-c('FALSE', 'FALSE', 'FALSE', 'TRUE', 'TRUE', 'TRUE', 'TRUE', 'TRUE', 'TRUE', 'TRUE')
shared <- data.frame(transcript, gene, model1, model2)
shared
transcript gene model1 model2
<chr> <chr> <chr> <chr>
t1 g1 TRUE FALSE
t2 g1 TRUE FALSE
t3 g1 TRUE FALSE
t4 g2 FALSE TRUE
t5 g2 FALSE TRUE
t6 g2 FALSE TRUE
t7 g3 TRUE TRUE
t8 g3 TRUE TRUE
t9 g3 TRUE TRUE
t10 g3 TRUE TRUE
I'd like to condense and summarize this df by the number and identity of TRUE values into something like this:
n_transcripts n_genes transcripts genes
<dbl> <dbl> <chr> <chr>
model1 7 2 t1;t2;t3;t7;t8;t9;t10 g1;g3
model2 7 2 t4;t5;t6;t7;t8;t9;t10 g2;g3
I've tried creating a blank df and summarizing from the initial df, but keep getting held up when it comes to distinguishing between model1 and model2. In reality, I have dozens of models and would like to avoid going through each model manually. Does anyone have ideas about where to start with this?
Pivot to longer format on model columns and filter only those rows that have value TRUE. Group by model and summarise wanted values.
shared %>%
pivot_longer(
cols = matches("^model[0-9]+$"),
names_to = "model"
) %>%
filter(value == TRUE) %>%
group_by(model) %>%
summarise(
n_transcripts = n(),
n_genes = length(unique(gene)),
transcripts = str_c(unique(transcript), collapse = ";"),
genes = str_c(unique(gene), collapse = ";"),
)