I have a large (rows > 200000) data frame with dozens of columns of data. I want to distill this data frame down and summarize the number of data that have variables that fall within given ranges.
For instance, if I have a data.frame
that is similar to this:
set.seed(10)
df <- data.frame( age = runif( n = 1000, min = 0, max = 4000 ),
size = rnorm( n = 1000, mean = 10, sd = 1 ),
shape = rnorm( n = 1000, mean = 1000, sd = 1000) )
and I would like to group get the number
of samples within a series of age ranges, the mean size
and shape
, and the median size
and shape
from the samples in each of those age brackets.
Something like
summary.df <- data.frame( age.group = seq( 0, 3900, by = 100 ),
number = (number of samples in age bin),
mean = ( mean of data in age bin ) )
etc.
Right now I am doing this very bluntly by creating a new data.frame
for each age group.
data.1 <- subset( df, age > 0 & age <= 100 )
data.2 <- subset( df, age > 100 & age <= 200 )
data.3 <- subset( df, age > 200 & age <= 300 )
etc. and then adding a categorical variable
data.1 <- data.frame( data.1, age.group = "100", count.row = nrow( data.1 ) )
data.2 <- data.frame( data.2, age.group = "200", count.row = nrow( data.2 ) )
data.3 <- data.frame( data.3, age.group = "300", count.row = nrow( data.3 ) )
adding them together
data.big <- rbind( data.1, data.2, data.3 )
and then generating summary stats via dplyr
data.summary <- data.big %>%
group_by( age.group ) %>%
summarize( count.row = mean( count.row ),
mean = mean( size, na.rm = TRUE ),
median = median( size, na.rm = T ) )
How would I go about doing this more efficiently with just dplyr
? I think there must be a way but I can't wrap my head around it.
Thanks for any help you can give!
You can make use of cut
to divide the data in intervals of 100 and calculate summary statistics for each group.
library(dplyr)
df %>%
group_by(age = cut(age, seq( 0, 4000, by = 100))) %>%
summarise(mean = mean( size, na.rm = TRUE),
median = median( size, na.rm = TRUE))
# age mean median
# <fct> <dbl> <dbl>
# 1 (0,100] 10.0 9.92
# 2 (100,200] 9.88 10.2
# 3 (200,300] 10.1 10.1
# 4 (300,400] 9.83 9.80
# 5 (400,500] 9.95 9.72
# 6 (500,600] 9.68 9.78
# 7 (600,700] 10.2 10.5
# 8 (700,800] 10.2 10.4
# 9 (800,900] 9.68 9.47
#10 (900,1e+03] 9.80 9.81
# … with 30 more rows