Search code examples
raggregate

R calculate how many values used to calculate mean in aggregate function


I have a dataframe of daily observations dating from 1963-2022. I want to calculate the mean of the observation for each month. However, some months don't have data for each day and some only have one datapoint for one month. This skews some of the data points. How do I calculate how many observations have been used to calculate the mean for a given month.

Head of Data frame

   structure(list(prcp_amt = c(0, 1.8, 6.4, 5.1, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 4.3, 0, 0, 0, 0, 4.6, 0, 0, 0, 0, 0, 0, 0, 0.3, 
4.8, 0, 0, 4.1, 0, 0, 0, 0.3, 3.6, 6.6, 0, 0, 0, 0, 0, 0, 0.8, 
0, 0, 0, 0, 0), ob_date = structure(c(-220838400, -220752000, 
-220665600, -220579200, -220492800, -220406400, -220320000, -220233600, 
-220147200, -220060800, -219974400, -219888000, -219801600, -219715200, 
-219628800, -219542400, -219456000, -219369600, -219283200, -219196800, 
-219110400, -219024000, -218937600, -218851200, -218764800, -218678400, 
-218592000, -218505600, -218419200, -218332800, -218246400, -218160000, 
-218073600, -217987200, -217900800, -217814400, -217728000, -217641600, 
-217555200, -217468800, -217382400, -217296000, -217209600, -217123200, 
-217036800, -216950400, -216864000, -216777600, -216691200, -216604800
), class = c("POSIXct", "POSIXt"), tzone = "GMT")), row.names = c(NA, 
50L), class = "data.frame")

Existing code

# historic monthly rainfall
rainHist$month <- as.numeric(format(rainHist$ob_date, '%m'))
rainHist$year <- as.numeric(format(rainHist$ob_date, '%Y'))
rainHistMean <- aggregate(prcp_amt ~ month + year, rainHist, FUN=mean)
rainHistMean$day <- 01

rainHistMean <-
  rainHistMean %>%
    mutate(rainHistMean, Date=paste(year, month, day, sep='-'))

rainHistMean[['Date']] <- as.POSIXct(rainHistMean[['Date']],
                                     format='%Y-%m-%d',
                                     tz='GMT'
                                     )

Updated Code

rainHist$month <- as.numeric(format(rainHist$ob_date, '%m'))
rainHist$year <- as.numeric(format(rainHist$ob_date, '%Y'))
rainHistMean <- aggregate(prcp_amt ~ month + year, rainHist, FUN=function(x) c(mean(x), length(x)))
names(rainHistMean) <- c('month', 'year', 'prcp_amt', 'n')

How do I get there to be 4 columns not 3 with a matrix?

enter image description here

Solution

rainHist$month <- as.numeric(format(rainHist$ob_date, '%m'))
rainHist$year <- as.numeric(format(rainHist$ob_date, '%Y'))
rainHistMean <- aggregate(prcp_amt ~ month + year, rainHist, FUN=function(x) c(mean(x), length(x)))
rainHistMean <- data.frame(rainHistMean[1:2], rainHistMean[[3]])
names(rainHistMean) <- c('month', 'year', 'prcp_amt', 'n')

Solution

  • There may be more elegant solutions, but you can use dplyr to group by month and year, then get the count and mean in summarize:

    df %>% 
      group_by(month(ob_date), year(ob_date)) %>% 
      summarize(mean_prcp = mean(prcp_amt),
                count = n())
    

    Output:

    # # Groups:   month(ob_date) [2]
    # `month(ob_date)` `year(ob_date)` mean_prcp count
    # <dbl>           <dbl>     <dbl> <int>
    # 1                1            1963      0.91    30
    # 2                2            1963      0.77    20