Search code examples
rdplyrgroupingcut

Stastistical summary of data group by Cut


I have hundreds of data providing body weight of small animals at various ages obtained from multiple research papers. I want to group and analyse the weight by age for periods of 7d. I have used cut in R successfuly to group the data by age and have the default number of values (count) in each 7d bin. But even after much Googling I can't find a means to extend 'cut' to obtain basic statistical summaries for each age bin to include mean, SE, CL and median. Is that possible and can anyone help or point me in the right direction?

I also looked at dplyr and although that seems to have the capability to provide statistical summaries, I couldn't see a way of grouping the ages by 7d intervals. Would that be a better alternative?

Thank you for any assistance from a non-programmer.

Additional info

Thanks to both of you for your comments. I am sorry to have provided limited information, I hope this clarifies the problem. I have over 2000 rows of data in Excel. The rows are age in days and the columns are 'MaleFI', MaleMEI' and MaleBW where FI is food intake, MEI is metabolizable energy intake, and BW is body weight. A section of the data is shown below. Usually I can analyse the data with an Excel Pivot Table but that does not include an option to calculate median or standard error; Power Pivot does but doesn't group! So the data are as follows (sorted by age in days and no blanks):

  • Age Male FI MaleMEI Male BW
  • 28.00 14.62 212.66 121.68
  • 28.00 13.82 201.03 112.15
  • 28.00 13.82 201.03 112.15
  • 29.00 15.12 220.31 125.14 ...

The ages are continous and may go up to about 900, with a varying number of values for each age.

The objective is to group by 7d periods and as I mention have statistical analysis on each group thus:

  • Age MaleFI MaleMEI MaleBW
  • 21-28 Mean Mean Mean
  • SE SE SE
  • Median Median Median
  • 29-35 Mean Mean Mean
  • SE SE SE
  • Median Median Median
  • 36-42 etc.

This is the code that I used to group the data, but as I mentioned, with hindsight probably using dplyr with group_by and summarise might be a better approach.

library("xlsx")
library("dplyr")
Pivot.data <- read.xlsx(file.choose(), 1)  # read first sheet
pt<-cut(Pivot.data$Age, breaks=seq(21, 800, by=7))
table(pt)

The output is (21,28] (28,35] (35,42] (42,49] (49,56] (56,63] (63,70] (70,77]
6 15 41 73 92 98 95 99

I would be happy with the format of the ranges, i.e (21,28] etc rather than 21-28.

Thanks for your comments and I am sorry not to have made the requirement clearer. I do appreciate you giving up your time to help.

Dan, here is the output using dput:

dput(head(Pivot.data, 20)) structure(list(Age = c(28, 28, 28, 28, 28, 28, 30, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35), MaleFI = c(14.62, 13.82, 10.6691449814126, 15.9859154929577, 11.7, 14.0273778252258, 13.5877862595419, 17.73, 17.93, 17.99, 22.1214285714286, 17.6, 22.48, 21.7, 19.6, 21.4, 21.25, 20.37, 19.3215613382899, 23.169014084507), MaleMEI = c(212.66252, 201.02572, 144.342862453531, 216.273450704225, 160.171462269, 204.047711328562, 197.653240885495, 257.90058, 241.76812, 261.68254, 298.285342857143, 238.3216, 304.40168, 315.6482, 285.1016, 311.2844, 309.1025, 296.30202, 261.401403345724, 313.453591549295), MaleBW = c(121.68, 112.15, 85.7142857142856, 143.181818181818, 109.20245398773, 89.8187948576385, 126.522593320235, 131.96, 127.98, 142.57, 126.92, 146.9, 145.45, 131.9, 129.8, 132.4, 191.21, 179.44, 138.095238095238, 202.272727272727)), row.names = c(NA, 20L), class = "data.frame")


Solution

  • If I understand correctly (but please, try to include a reproducible example next time), this code should do:

    library(tidyverse)
    your_df %>% 
      # mutate(Age_cl = cut(Age, c(-Inf,30,60,Inf))) %>% 
      mutate(Age_cl = cut(Age, breaks=c(-Inf, seq(min(Age), max(Age), by=7), Inf), right = F)) %>%
      group_by(Age_cl) %>% 
      summarise(
        n=n(),
        m=mean(Age, na.rm=T),
        sd=sd(Age, na.rm=T),
        max=max(Age, na.rm=T),
        min=min(Age, na.rm=T)
      )
    

    You create the new column age_cl which is the result of cut, then you group by this new column and summarise (or you could mutate again, as you want) all the summaries you want.

    You can also use summarise_at with a list of functions:

    your_df %>% 
      mutate(Age_cl = cut(Age, c(-Inf,30,60,Inf))) %>% 
      group_by(Age_cl) %>% 
      summarise_at("Age", list(m=mean, sd=sd, max=max, min=min), na.rm=T)
    

    Note that the other arguments will apply to all functions, like na.rm here.

    EDIT: For the records, it seems that there is a way to use the summary function, but you have to transform its results into a data.frame for it to work. This may not be worth it for summary, compared to writing your functions like above. Here is an example with the iris dataset:

    iris %>% 
      mutate(Sepal.Length_cl = cut(Sepal.Length, c(-Inf,5,6,Inf))) %>% 
      group_by(Sepal.Length_cl) %>% 
      group_modify(~summary(.$Sepal.Length) %>% unclass %>% t %>% as.data.frame)