Search code examples
rdplyrsummary

Tidy way to summarize several columns of data in R?


I have a large dataframe of transcript IDs and their associated gene names, as well as a column that indicates if a certain model reported that gene/transcript as significant. Every transcript will always be linked to a single gene, although each gene is associated with multiple transcripts. Here's a reproducible example:

transcript <- c('t1', 't2', 't3', 't4', 't5', 't6', 't7','t8', 't9', 't10')
gene <- c('g1', 'g1', 'g1', 'g2', 'g2', 'g2', 'g3','g3', 'g3', 'g3')
model1 <- c('TRUE', 'TRUE', 'TRUE', 'FALSE', 'FALSE', 'FALSE', 'TRUE', 'TRUE', 'TRUE', 'TRUE')
model2 <-c('FALSE', 'FALSE', 'FALSE', 'TRUE', 'TRUE', 'TRUE', 'TRUE', 'TRUE', 'TRUE', 'TRUE')
shared <- data.frame(transcript, gene, model1, model2)
shared

transcript gene model1 model2
<chr> <chr> <chr> <chr>
t1  g1  TRUE    FALSE   
t2  g1  TRUE    FALSE   
t3  g1  TRUE    FALSE   
t4  g2  FALSE   TRUE    
t5  g2  FALSE   TRUE    
t6  g2  FALSE   TRUE    
t7  g3  TRUE    TRUE    
t8  g3  TRUE    TRUE    
t9  g3  TRUE    TRUE    
t10 g3  TRUE    TRUE

I'd like to condense and summarize this df by the number and identity of TRUE values into something like this:

n_transcripts n_genes transcripts genes
<dbl> <dbl> <chr> <chr>
model1  7   2   t1;t2;t3;t7;t8;t9;t10   g1;g3
model2  7   2   t4;t5;t6;t7;t8;t9;t10   g2;g3

I've tried creating a blank df and summarizing from the initial df, but keep getting held up when it comes to distinguishing between model1 and model2. In reality, I have dozens of models and would like to avoid going through each model manually. Does anyone have ideas about where to start with this?


Solution

  • Pivot to longer format on model columns and filter only those rows that have value TRUE. Group by model and summarise wanted values.

    shared %>%
      pivot_longer(
        cols = matches("^model[0-9]+$"),
        names_to = "model"
      ) %>%
      filter(value == TRUE) %>%
      group_by(model) %>%
      summarise(
        n_transcripts = n(),
        n_genes = length(unique(gene)),
        transcripts = str_c(unique(transcript), collapse = ";"),
        genes = str_c(unique(gene), collapse = ";"),
      )