Search code examples
rmaxtidyversecombn

R - maximum value of variables when compared between levels of variable1 grouped by variable2


Consider the following data

set.seed(123)

example.df <- data.frame( 
gene = sample(c("A", "B", "C", "D"), 100, replace = TRUE),
treated = sample(c("Yes", "No"), 100, replace = TRUE), 
resp=rnorm(100, 10,5), effect = rnorm (100, 25, 5))

I am trying to get the maximum value for all variables when they are compared by the levels of gene and grouped by treated. I can create the gene combinations like so,

combn(sort(unique(example.df$gene)), 2, simplify = T)

#     [,1] [,2] [,3] [,4] [,5] [,6]
#[1,] A    A    A    B    B    c   
#[2,] B    c    D    c    D    D   
#Levels: A B c D

Edit: The output I am looking for is a dataframe like this

comparison   group    max.resp    max.effect
A-B          no       value1      value2
....
C-D          no       valueX      valueY
A-B          yes      value3      value4 
.... 
C-D          yes      valueXX     valueYY

While I am able to get the max values for each individual gene level grouped by treated...

max.df <- example.df %>% 
           group_by(treated, gene) %>% 
           nest() %>% 
           mutate(mod = map(data, ~summarise_if(.x, is.numeric, max, na.rm = TRUE))) %>% 
           select(treated, gene, mod) %>% 
           unnest(mod) %>% 
           arrange(treated, gene)

Despite trying to tackle the issue for more than a day, I cannot figure out how to get the max for each numeric variable for each 2 level gene comparison (A vs B, A vs C, A vs D, B vs C, B vs D, and C vs D) grouped by treated.

Any help is appreciated. Thanks.


Solution

  • I found a solution, it might be a little messy, but I will update it in a better way, it takes no time whatsoever

    library(tidyverse)
    

    First I generate a dataframe with two columns, Gen1 and Gen2 for al possible comparisons, very similar to your use of combn but creating a data.frame

    GeneComp <- expand.grid(Gen1 = unique(example.df$gene), Gen2 = unique(example.df$gene)) %>% filter(Gen1 != Gen2) %>% arrange(Gen1)
    

    Then I loop throught it grouping by

    Comps <- list()
    for(i in 1:nrow(GeneComp)){
      Comps[[i]] <- example.df %>% filter(gene == GeneComp[i,]$Gen1 | gene == GeneComp[i,]$Gen2) %>% # This line filters only the data with genes in the ith row
      group_by(treated) %>% # Then gorup by treated
      summarise_if(is.numeric, max) %>% # then summarise max if numeric
      mutate(Comparison = paste(GeneComp[i,]$Gen1, GeneComp[i,]$Gen2, sep = "-")) # and generate the comparisson variable
    }
    
    Comps <- bind_rows(Comps) # and finally join in a data frame
    

    let me know if it does everything you want

    Adding in order to get only the data one time

    It is important here that your genes are strings and not factors so you might have to do this

    options(stringsAsFactors = FALSE)
    
    example.df <- data.frame( 
      gene = c(sample(c("A", "B", "C", "D"), 100, replace = TRUE)),
      treated = sample(c("Yes", "No"), 100, replace = TRUE), 
      resp=rnorm(100, 10,5), effect = rnorm (100, 25, 5))
    

    Then again in expand.grid add the stringsAsFactors = F argument

    GeneComp <- expand.grid(Gen1 = unique(example.df$gene), Gen2 = unique(example.df$gene), stringsAsFactors = F) %>% filter(Gen1 != Gen2) %>% arrange(Gen1)
    

    Now that allows you in the loop when pasting the Comparisson variable to sort both inputs, with that, the lines will be duplicated, but when you use the distinct function at the end, it will make your data the way you want it

    Comps <- list()
    for(i in 1:nrow(GeneComp)){
        Comps[[i]] <- example.df %>% filter(gene == GeneComp[i,]$Gen1 | gene == GeneComp[i,]$Gen2) %>% # This line filters only the data with genes in the ith row
        group_by(treated) %>% # Then gorup by treated
        summarise_if(is.numeric, max) %>% # then summarise max if numeric
        mutate(Comparison = paste(sort(c(GeneComp[i,]$Gen1, GeneComp[i,]$Gen2))[1], sort(c(GeneComp[i,]$Gen1, GeneComp[i,]$Gen2))[2], sep = "-")) # and generate the comparisson variable
    }
    
    Comps <- bind_rows(Comps) %>% distinct() # and finally join in a data frame