Search code examples
rdplyrsummarize

Divide monthly summary of column by value in other column using dplyr


I have a dataframe called df that looks like this:

id    date      year  month day value      QC start_yr end_yr series_length                                         
                                    
275 28/09/2010  2010    9   28  1.36159873  1   2010    2020    11
275 29/09/2010  2010    9   29  1.43747532  1   2010    2020    11
275 30/09/2010  2010    9   30  1.15989387  1   2010    2020    11
275 01/10/2010  2010    10  1   1.34619892  1   2010    2020    11
275 02/10/2010  2010    10  2   1.55105293  1   2010    2020    11
275 03/10/2010  2010    10  3   1.37231886  1   2010    2020    11
275 04/10/2010  2010    10  4   1.02093196  1   2010    2020    11
275 05/10/2010  2010    10  5   1.70137763  1   2010    2020    11
275 06/10/2010  2010    10  6   1.64840853  1   2010    2020    11
275 07/10/2010  2010    10  7   1.92625344  1   2010    2020    11
275 08/10/2010  2010    10  8   1.07618248  1   2010    2020    11
275 09/10/2010  2010    10  9   1.04859626  1   2010    2020    11
275 10/10/2010  2010    10  10  0.77795529  1   2010    2020    11
275 11/10/2010  2010    10  11  1.42359686  1   2010    2020    11
275 12/10/2010  2010    10  12  0.91871309  1   2010    2020    11
275 13/10/2010  2010    10  13  0.89544863  1   2010    2020    11
275 14/10/2010  2010    10  14  0.56718421  1   2010    2020    11
275 15/10/2010  2010    10  15  0.70719278  1   2010    2020    11

(with thousands of other rows)

I want to create a summary dataset which groups by id and sums the value on a monthly basis, and divides by the series_length, to give an overall monthly average

I can do this for each individual station by using:

mean_monthly <- df %>%
  group_by(id, month) %>%
  summarise(monthly_value = sum(value)/11)

and this works fine for the first id which has a series length of 11, but other ids have a series_length of between 1 and 30. So dividing these by 11 gives incorrect information. I want to summarise so that I create a new dataset which sums all the value values on a monthly basis, and divides by the series_length automatically, without me having to specify the number like I have done with '11' above.

I hope this makes sense and would greatly appreciate some help.


Solution

  • As long as series_length is repeated you could also use min, max, mean, or last instead of first. They would all return the same value.

    df %>% 
      group_by(id, month) %>% 
      summarize(monthly_value = sum(value) / first(series_length))
    

    If series_length is just a count of the number of observations in your group_by then you can also use n():

    df %>% 
      group_by(id, month) %>% 
      summarize(monthly_value = sum(value) / n())
    

    I'm not sure if this is the case because in your repex when month is 10 there are more than 11 observations (the value in series_length).