Search code examples
rdataframecrosstab

Summarize two dataframes in r


I have two dataframes

df1
#    var1 var2
# 1 X01    Red
# 2 X02    Green
# 3 X03    Red
# 4 X04    Yellow
# 5 X05    Red
# 6 X06    Green
df2
#   X01    X02    X03   ...
# 1 1      0.1    2.1
# 2 2      0.2    2.2
# 3 3      0.3    2.3
# 4 4      0.4    2.4
# 5 5      0.5    2.5
# 6 6      0.6    2.6

I am trying to get something like this

            
mean green  val1
mean red    val2
mean yellow val3

when mean is calculated with the corresponding variables.


Solution

  • We could change match the column names from 'df2' with the keyval pair from 'df1' to replace with the color value, replicate by the colum index, unlist the 'df2' and use a group by approach to get the mean

    tapply(unlist(df2), setNames(df1$var2, df1$var1)[names(df2)][col(df2)], 
        FUN = mean, na.rm = TRUE)
    

    Or using tidyverse, reshape to 'long' and do a join before doing the group_by mean

    library(dplyr)
    library(tidyr)
    df2 %>% 
     pivot_longer(cols = everything(), names_to = 'var1') %>% 
     left_join(df1) %>% 
     group_by(var2) %>% 
     summarise(value = mean(value, na.rm = TRUE), .groups = 'drop')
    

    data

    df1 <- structure(list(var1 = c("X01", "X02", "X03", "X04", "X05", "X06"
    ), var2 = c("Red", "Green", "Red", "Yellow", "Red", "Green")),
       class = "data.frame", row.names = c("1", 
    "2", "3", "4", "5", "6"))
    
    df2 <- structure(list(X01 = 1:6, X02 = c(0.1, 0.2, 0.3, 0.4, 0.5, 0.6
    ), X03 = c(2.1, 2.2, 2.3, 2.4, 2.5, 2.6)), 
      class = "data.frame", row.names = c("1", 
    "2", "3", "4", "5", "6"))