Search code examples
rfor-loopdplyrapplysummary

How to repeat summarise a column based on multiple other column groupby s


Let's say I want to calculate the mean (or a custom function) on column A based on different values in columns B-D. Here is the data:

input:
data <- data.frame(A = round(runif(20,min = 0,max = 10),0),
                   B = round(runif(20,min = 0,max = 1),0),
                   C = round(runif(20,min = 0,max = 1),0),
                   D = round(runif(20,min = 0,max = 1),0))

output (note your rand numbers might result in different summary table):
col value mean    
B   0     5.92
B   1     4.71
C   0     6   
C   1     5.17
D   0     4.89
D   1     6

I can do it for each column separately:

data %>% group_by(B) %>% summarise(mean(A))

I put it in a for loop:

p <- data.frame(NULL)
for(i in c('B','C','D')){
  q <- data %>% group_by_(i) %>% summarise(col=i,mean = mean(A))
  p <- append(p,q)
}

but it didnt work quite as expected. Any suggestions would be very helpful.


Solution

  • An option would be to gather the data into 'long' format, grouped by 'key', 'val' columns, get the mean of 'A'

    library(tidyverse)
    gather(data, key, val, B:D) %>%
         group_by(key, val) %>%
         summarise(A = mean(A))
    

    Or in base R, by unlisting the columnss from 'B' to 'D' and using the grouping column as 'A' with the replicated column names

    aggregate(A ~ ., cbind(data['A'], cN = names(data)[-1][col(data[-1])], 
               group = unlist(data[-1])), mean)
    

    data

    set.seed(24)
    data <- data.frame(A = round(runif(20,min = 0,max = 10),0),
                   B = round(runif(20,min = 0,max = 1),0),
                   C = round(runif(20,min = 0,max = 1),0),
                   D = round(runif(20,min = 0,max = 1),0))