Search code examples
rdataframegroup-bydplyrsummarize

group_by, using different functions for different columns based on their name


I have a large df with many columns (+100), some of which have a name that ends in "_e" and others in "_se". I want to summarize these variables using the label of the column sign. For those columns that ent in "_e", I would like to have the sum of the values. For those ended in "_se", the formula to use is the square root of the sum of the square power of the values - sqrt(sum((x)^2).

Is there I way I could use to aggregate my data using these different functions (sum and sqrt(sum((x)^2)) depending on the way the columns are named? An added condition is that the grouping only takes place for those columns whose name either ends on _e or _se, avoiding all those that are not named like this.

The desired outcome would then be a df with 2 rows - one for "cold" and one for "hot", with the new aggregated values for the columns that end in "_e" or "_se". see dummy data below:

structure(list(POC_e = c(58521L, 16161L, 1194L, 2699L, 1271L, 
1967L, 28248L, 12160L, 14397L, 7079L, 13667L, 4951L, 6604L, 13696L, 
12551L, 2772L, 1848L, 34065L, 33594L, 67847L, 25307L, 70405L, 
93996L, 65963L, 39789L, 20335L, 15903L, 108016L, 31371L, 24278L, 
50822L, 20552L, 37172L, 41811L, 28593L, 16834L, 26969L), POC_se = c(1291.92647266486, 
939.72404052784, 190.346374094663, 388.212893466502, 208.822332096987, 
252.213709480391, 1160.08565499059, 838.438264515793, 878.262309647022, 
535.666492080041, 705.069168328457, 416.437334300563, 598.234978711058, 
937.860748119978, 626.051576985209, 329.912572655009, 297.104027430243, 
1103.07866902155, 1068.06832578293, 1581.78604703798, 1060.61990776234, 
1419.56938660059, 1844.54038903284, 1766.10256323752, 1983.39556445415, 
882.694867765916, 805.685353170852, 2313.56161072355, 920.266765311779, 
967.61140849172, 1634.66447221862, 504.867156218601, 824.885933362813, 
975.720818549529, 1096.29928336111, 777.860187976763, 952.665438551264
), BelPov_e = c(20565L, 3812L, 66L, 840L, 164L, 145L, 13560L, 
3952L, 5483L, 2591L, 3093L, 1928L, 1149L, 4197L, 4125L, 336L, 
190L, 8558L, 7332L, 14872L, 5141L, 14086L, 28506L, 10753L, 4411L, 
5961L, 5799L, 22387L, 5189L, 3335L, 7407L, 867L, 3077L, 3329L, 
2817L, 1597L, 2406L), BelPov_se = c(1068.87660902013, 411.317970001003, 
23.6457754878744, 172.906389645903, 54.6876004289108, 55.5557916537185, 
892.989254922357, 401.344733821059, 604.616650266614, 336.438879762679, 
404.301023537906, 282.386361349728, 147.670323118653, 496.534865097384, 
483.679751234656, 103.488187597609, 80.8533491461191, 648.06537390533, 
661.960303317718, 953.091653167516, 484.435550214379, 1032.29133967345, 
1525.69708374773, 869.176346634638, 637.864956679027, 655.848453420799, 
525.336305385565, 1478.17209676848, 420.427043758192, 416.538929000497, 
658.333683680529, 121.413258770268, 278.074805599612, 361.857390496834, 
298.858858908146, 234.619743804892, 359.830153667996), sign = c("cold", 
"cold", "cold", "cold", "cold", "cold", "cold", "cold", "cold", 
"cold", "cold", "cold", "cold", "cold", "cold", "cold", "cold", 
"hot", "hot", "hot", "hot", "hot", "hot", "hot", "hot", "cold", 
"cold", "hot", "hot", "hot", "hot", "hot", "hot", "hot", "hot", 
"hot", "hot")), row.names = c(NA, 37L), class = "data.frame")

Solution

  • in Base-R

    e <- df1[,grep("_e", colnames(df1))]
    se <- df1[,grep("_se", colnames(df1))]
    
    e <- lapply(split(e, df1$sign), colSums)
    se <-  lapply(split(se, df1$sign), function(x) apply(x,2,function(y) sqrt(sum(y^2))))
    
    cbind(do.call(rbind,e),do.call(rbind,se))
    

    output:

          POC_e BelPov_e   POC_se BelPov_se
    cold 236024    77956 3143.043  2062.604
    hot  817384   146070 5732.776  3165.828