Search code examples
rdataframe

How can I group variables by several criteria in a data frame to count the number of values greater than each grouped mean?


I am trying to count data on a big data frame. First I estimated means by group, and I 've calculated means for each group of each variable. I you want to count the occurrences of values in each group that are greater than their respective group means.

I have a data frame simmilar to this:

stage   sex cort    granu   
a   m   0.53    50
a   m   0.65    93
a   f   3.18    49
a   m   1.51    45
a   m   0.28    57
a   f   0.50    36
a   f   1.20    21
a   f   2.36    58
b   m   0.27    74
b   m   0.16    36
b   f   0.42    27
b   m   0.88    65
b   f   2.19    85
b   m   0.24    74
b   f   0.89    52
b   m   0.34    98
b   f   2.59    52
b   f   0.74    63
c   m   6.36    54
c   f   2.07    24
c   m   1.46    12
c   m   2.07    36
c   f   1.46    75
c   m   2.19    96
c   f   6.36    21
c   f   2.19    85

I would like to count the number of values that are greater than the mean of the group, to the moment I did something like this




mean_c<- aggregate(x= data$cort,
                       by = list(data$sex,data$stage),      
                       FUN = mean)

cah<-dplyr::filter(data,stage=="a" & sex=="h")
cah<-dplyr::select(cah, c(cort))
countah<-corta>1.81 #mean for the group (a,h)

cam<-dplyr::filter(data,stage=="a" & sex=="m")
cam<-dplyr::select(cam, c(cort))
countam<-corta>0.74 #mean for the group (a,m)

sum(countah)

But, I need to do the same with several variables in all the stages (a,b and c) and with both sexes (f,m). I can repeat the last four lines with each mean, but I think there may be an easier and less redundant way to do this but I can't think of any.


Solution

  • The functions group_by() and summarise() are what you need:

    library(dplyr)
    
    df <- read.table(text = "stage   sex cort    granu   
    a   m   0.53    50
    a   m   0.65    93
    a   f   3.18    49
    a   m   1.51    45
    a   m   0.28    57
    a   f   0.50    36
    a   f   1.20    21
    a   f   2.36    58
    b   m   0.27    74
    b   m   0.16    36
    b   f   0.42    27
    b   m   0.88    65
    b   f   2.19    85
    b   m   0.24    74
    b   f   0.89    52
    b   m   0.34    98
    b   f   2.59    52
    b   f   0.74    63
    c   m   6.36    54
    c   f   2.07    24
    c   m   1.46    12
    c   m   2.07    36
    c   f   1.46    75
    c   m   2.19    96
    c   f   6.36    21
    c   f   2.19    85", header = TRUE, check.names = FALSE)
    
    df1 <- df %>%
      group_by(stage, sex) %>%
      summarise(sum_gt_mean = sum(cort > mean(cort)), .groups = "drop")
    
    df1
    # A tibble: 6 × 3
    stage    sex    sum_gt_mean
    <chr>   <chr>     <int>
    1 a       f         2
    2 a       m         1
    3 b       f         2
    4 b       m         1
    5 c       f         1
    6 c       m         1