I have a dataframe called df
that looks like this:
id date year month day value QC start_yr end_yr series_length
275 28/09/2010 2010 9 28 1.36159873 1 2010 2020 11
275 29/09/2010 2010 9 29 1.43747532 1 2010 2020 11
275 30/09/2010 2010 9 30 1.15989387 1 2010 2020 11
275 01/10/2010 2010 10 1 1.34619892 1 2010 2020 11
275 02/10/2010 2010 10 2 1.55105293 1 2010 2020 11
275 03/10/2010 2010 10 3 1.37231886 1 2010 2020 11
275 04/10/2010 2010 10 4 1.02093196 1 2010 2020 11
275 05/10/2010 2010 10 5 1.70137763 1 2010 2020 11
275 06/10/2010 2010 10 6 1.64840853 1 2010 2020 11
275 07/10/2010 2010 10 7 1.92625344 1 2010 2020 11
275 08/10/2010 2010 10 8 1.07618248 1 2010 2020 11
275 09/10/2010 2010 10 9 1.04859626 1 2010 2020 11
275 10/10/2010 2010 10 10 0.77795529 1 2010 2020 11
275 11/10/2010 2010 10 11 1.42359686 1 2010 2020 11
275 12/10/2010 2010 10 12 0.91871309 1 2010 2020 11
275 13/10/2010 2010 10 13 0.89544863 1 2010 2020 11
275 14/10/2010 2010 10 14 0.56718421 1 2010 2020 11
275 15/10/2010 2010 10 15 0.70719278 1 2010 2020 11
(with thousands of other rows)
I want to create a summary dataset which groups by id and sums the value on a monthly basis, and divides by the series_length, to give an overall monthly average
I can do this for each individual station by using:
mean_monthly <- df %>%
group_by(id, month) %>%
summarise(monthly_value = sum(value)/11)
and this works fine for the first id which has a series length of 11, but other ids have a series_length of between 1 and 30. So dividing these by 11 gives incorrect information. I want to summarise so that I create a new dataset which sums all the value
values on a monthly basis, and divides by the series_length automatically, without me having to specify the number like I have done with '11' above.
I hope this makes sense and would greatly appreciate some help.
As long as series_length
is repeated you could also use min
, max
, mean
, or last
instead of first
. They would all return the same value.
df %>%
group_by(id, month) %>%
summarize(monthly_value = sum(value) / first(series_length))
If series_length
is just a count of the number of observations in your group_by
then you can also use n()
:
df %>%
group_by(id, month) %>%
summarize(monthly_value = sum(value) / n())
I'm not sure if this is the case because in your repex when month
is 10 there are more than 11 observations (the value in series_length
).