Distilling summary statistics by numerical categories with dplyr

I have a large (rows > 200000) data frame with dozens of columns of data. I want to distill this data frame down and summarize the number of data that have variables that fall within given ranges.

For instance, if I have a data.frame that is similar to this:

 set.seed(10)
 df <- data.frame( age = runif( n = 1000, min = 0, max = 4000 ),
                   size = rnorm( n = 1000, mean = 10, sd = 1 ),
                   shape = rnorm( n = 1000, mean = 1000, sd = 1000) )

and I would like to group get the number of samples within a series of age ranges, the mean size and shape, and the median size and shape from the samples in each of those age brackets.

Something like

 summary.df <- data.frame( age.group = seq( 0, 3900, by = 100 ),
                           number = (number of samples in age bin),
                           mean = ( mean of data in age bin ) )

etc.

Right now I am doing this very bluntly by creating a new data.frame for each age group.

data.1          <- subset( df, age > 0 & age <= 100 )
data.2          <- subset( df, age > 100 & age <= 200 )
data.3          <- subset( df, age > 200 & age <= 300 )

etc. and then adding a categorical variable

data.1 <- data.frame( data.1, age.group = "100", count.row = nrow( data.1 ) )
data.2 <- data.frame( data.2, age.group = "200", count.row = nrow( data.2 ) )
data.3 <- data.frame( data.3, age.group = "300", count.row = nrow( data.3 ) )

adding them together

data.big <- rbind( data.1, data.2, data.3 )

and then generating summary stats via dplyr

data.summary <- data.big %>%
   group_by( age.group ) %>%
   summarize( count.row = mean( count.row ),
         mean = mean( size, na.rm = TRUE ),
         median = median( size, na.rm = T ) )

How would I go about doing this more efficiently with just dplyr? I think there must be a way but I can't wrap my head around it.

Thanks for any help you can give!

Solution

You can make use of cut to divide the data in intervals of 100 and calculate summary statistics for each group.

library(dplyr)

df %>%
  group_by(age = cut(age, seq( 0, 4000, by = 100))) %>%
  summarise(mean = mean( size, na.rm = TRUE),
            median = median( size, na.rm = TRUE))

#   age          mean median
#   <fct>       <dbl>  <dbl>
# 1 (0,100]     10.0    9.92
# 2 (100,200]    9.88  10.2 
# 3 (200,300]   10.1   10.1 
# 4 (300,400]    9.83   9.80
# 5 (400,500]    9.95   9.72
# 6 (500,600]    9.68   9.78
# 7 (600,700]   10.2   10.5 
# 8 (700,800]   10.2   10.4 
# 9 (800,900]    9.68   9.47
#10 (900,1e+03]  9.80   9.81
# … with 30 more rows