Search code examples
rdplyr

Create a new column with averages for time intervals


I have a dataset that begins in year 1988 and ends in year 2020. I want to create averages for certain time intervals. For example, 5 years: 1988-1992, 1993-1997 and so on. But I want a new column with these averages.

For example, suppose I have this:

anos <- 1988:2020
valores <- c(15, 18, 20, NA, 25, 27, 28, NA, 32, 35, 36, 38, 40, 
             42, 45, 46, NA, 50, 52, 55, 56, 58, 60, NA, 65, 66, 
             68, 70, 72, 75, 76, 78, 80)

dataset <- data.frame(Ano = anos, Valor = valores)

I want to have this:

anos <- 1988:2020

valores <- c(15, 18, 20, NA, 25, 27, 28, NA, 32, 35, 36, 38, 40, 
             42, 45, 46, NA, 50, 52, 55, 56, 58, 60, NA, 65, 66, 
             68, 70, 72, 75, 76, 78, 80)

medias <- c(19.5, 19.5, 19.5, 19.5, 19.5,
            30.5, 30.5, 30.5, 30.5, 30.5,
            40.2, 40.2, 40.2, 40.2, 40.2,
            50.75, 50.75, 50.75, 50.75, 50.75,
            59.75, 59.75, 59.75, 59.75, 59.75,
            70.2, 70.2, 70.2, 70.2, 70.2,
            78, 78, 78)
            
dataset <- data.frame(Ano = anos, Valor = valores, Medias = medias)

There are five year averages for each time interval and the mean value repeats itself until the next five years. The last mean is an average of 3 values since the overall period is not a multiple of five.

EDIT: also, I will need to use group_by().


Solution

  • Here's a dplyr version :

    library(dplyr)
    n <- 5
    
    dataset %>%
      arrange(Ano) %>%
      group_by(Group = floor((Ano - first(Ano))/n)) %>%
      mutate(Medias = mean(Valor, na.rm = TRUE)) %>%
      data.frame()
    

    which returns the following -

    #    Ano Valor Group Medias
    #1  1988    15     0  19.50
    #2  1989    18     0  19.50
    #3  1990    20     0  19.50
    #4  1991    NA     0  19.50
    #5  1992    25     0  19.50
    #6  1993    27     1  30.50
    #7  1994    28     1  30.50
    #8  1995    NA     1  30.50
    #9  1996    32     1  30.50
    #10 1997    35     1  30.50
    #11 1998    36     2  40.20
    #12 1999    38     2  40.20
    #13 2000    40     2  40.20
    #14 2001    42     2  40.20
    #15 2002    45     2  40.20
    #16 2003    46     3  50.75
    #17 2004    NA     3  50.75
    #18 2005    50     3  50.75
    #19 2006    52     3  50.75
    #20 2007    55     3  50.75
    #21 2008    56     4  59.75
    #22 2009    58     4  59.75
    #23 2010    60     4  59.75
    #24 2011    NA     4  59.75
    #25 2012    65     4  59.75
    #26 2013    66     5  70.20
    #27 2014    68     5  70.20
    #28 2015    70     5  70.20
    #29 2016    72     5  70.20
    #30 2017    75     5  70.20
    #31 2018    76     6  78.00
    #32 2019    78     6  78.00
    #33 2020    80     6  78.00
    

    In the example that you have shared you have a single entry for each year and data for all the year is present however, this will also work when :

    1. You have more than 1 rows for a year.
    2. You don't have data for all the years.