Search code examples
rcasemutate

Using mutate to get an average of the previous weeks temperature


I'm looking for a way to create a new variable for an average of the temperature the week before.

I've got a data frame that looks like the following

weather_data

Rows: 3,664
Columns: 2
$ dt                  <date> 2014-01-01, 2014-01-02, 2014-01-03, 2014-01-04, 2014-01…
$ temp                <dbl> 6.390000, 6.234167, 6.307500, 4.436250, 4.432917, 8.4508…

What I'm hoping to do is get a new column called prev.temp that gives the average temperature for the previous week.

I know how to use mutate to create the new column but am struggling with the referencing. I can't group_by because the "week prior" is a rolling thing but I'm struggling with trying to reference the date for the observation and also the 7 dates of information I need.

My first thought was something like this: mutate(prev.temp = casewhen(date > date - 7, mean(temp))) but I'm sure its not right.


Solution

  • Two options:

    1. A "rolling window" works well when you are "guaranteed" to have no gaps in your dates. One assumption here is that 1-row is always 1 day, so we rely on that being the case. Another assumption is that the data is pre-ordered by dt.

      We'll use zoo::rollapplyr here, width of 7. partial= means that during the first week of data, we'll still average those observations before today.

      zoo::rollapplyr(weather_data$temp, 7, FUN = mean, na.rm = TRUE, partial = TRUE)
      #  [1] 9.233254 9.333467 7.414063 7.679054 7.498385 7.193965 7.256155 6.253119 5.893012 6.431630 5.952437 6.051908 6.586221 5.967587 6.388821 6.752706 7.103914
      # [18] 6.666444 6.352582 5.871288 6.705206 6.289171 6.352013 6.311438 6.266374 6.316793 6.098056 6.100250 6.496584 6.300014 6.031206 6.968000 6.805867 7.185110
      # [35] 6.025666 6.521882 5.456449 4.775102 4.897948 5.185524 4.792596 5.347797 4.325031 5.567295 5.855414 5.920953 6.275780 6.610605 7.298713 8.046237 7.223235
      

      Verifying the data, the first week of data should be in the 7th position of this vector (it is):

      mean(weather_data$temp[1:7])
      # [1] 7.256155
      

      We need to lag this by one back into the data, so we can do

      weather_data %>%
        mutate(prev.temp = lag(zoo::rollapplyr(temp, 7, FUN = mean, na.rm = TRUE, partial = TRUE)))
      # # A tibble: 51 × 3
      #    dt          temp prev.temp
      #    <date>     <dbl>     <dbl>
      #  1 2023-11-29  9.23     NA   
      #  2 2023-11-30  9.43      9.23
      #  3 2023-12-01  3.58      9.33
      #  4 2023-12-02  8.47      7.41
      #  5 2023-12-03  6.78      7.68
      #  6 2023-12-04  5.67      7.50
      #  7 2023-12-05  7.63      7.19
      #  8 2023-12-06  2.21      7.26
      #  9 2023-12-07  6.91      6.25
      # 10 2023-12-08  7.35      5.89
      # # ℹ 41 more rows
      # # ℹ Use `print(n = ...)` to see more rows
      

      If you think you may have "gappy data", then all you need to do is interpolate the missing dates, give them a temp of NA, arrange by date, and then we are back to business with "guaranteed no gaps".

      weather_data_gapped <- weather_data[-c(4, 8),]
      head(weather_data_gapped, 10)
      # # A tibble: 10 × 2
      #    dt          temp
      #    <date>     <dbl>
      #  1 2023-11-29  9.23
      #  2 2023-11-30  9.43
      #  3 2023-12-01  3.58
      #  4 2023-12-03  6.78
      #  5 2023-12-04  5.67
      #  6 2023-12-05  7.63
      #  7 2023-12-07  6.91
      #  8 2023-12-08  7.35
      #  9 2023-12-09  5.12
      # 10 2023-12-10  7.47
      weather_data_gapped %>%
        reframe(dt = seq(min(dt), max(dt), by="day")) %>%
        left_join(weather_data_gapped, by = "dt")
      # # A tibble: 51 × 2
      #    dt          temp
      #    <date>     <dbl>
      #  1 2023-11-29  9.23
      #  2 2023-11-30  9.43
      #  3 2023-12-01  3.58
      #  4 2023-12-02 NA   
      #  5 2023-12-03  6.78
      #  6 2023-12-04  5.67
      #  7 2023-12-05  7.63
      #  8 2023-12-06 NA   
      #  9 2023-12-07  6.91
      # 10 2023-12-08  7.35
      # # ℹ 41 more rows
      # # ℹ Use `print(n = ...)` to see more rows
      

      (... and then do the rollapply thing as above.)

    2. A non-equi join also works, perhaps more akin to your case_when thought.

      weather_data %>%
        mutate(dt0 = dt, dt_from = dt - 8, dt_to = dt - 1, temp) %>%
        left_join(weather_data, join_by(between(y$dt, x$dt_from, x$dt_to)), suffix = c("", ".prev")) %>%
        summarize(prev.temp = mean(temp.prev, na.rm = TRUE), .by = c(dt0, temp)) %>%
        rename(dt = dt0)
      # # A tibble: 51 × 3
      #    dt          temp prev.temp
      #    <date>     <dbl>     <dbl>
      #  1 2023-11-29  9.23    NaN   
      #  2 2023-11-30  9.43      9.23
      #  3 2023-12-01  3.58      9.33
      #  4 2023-12-02  8.47      7.41
      #  5 2023-12-03  6.78      7.68
      #  6 2023-12-04  5.67      7.50
      #  7 2023-12-05  7.63      7.19
      #  8 2023-12-06  2.21      7.26
      #  9 2023-12-07  6.91      6.63
      # 10 2023-12-08  7.35      6.34
      # # ℹ 41 more rows
      # # ℹ Use `print(n = ...)` to see more rows
      

      This method is resilient to gapped data as well as out-of-order data.


    Sample data

    set.seed(42)
    weather_data <- tibble(dt = Sys.Date() - 50:0, temp = runif(51, 1, 10))
    head(weather_data, 10)
    # # A tibble: 10 × 2
    #    dt          temp
    #    <date>     <dbl>
    #  1 2023-11-29  9.23
    #  2 2023-11-30  9.43
    #  3 2023-12-01  3.58
    #  4 2023-12-02  8.47
    #  5 2023-12-03  6.78
    #  6 2023-12-04  5.67
    #  7 2023-12-05  7.63
    #  8 2023-12-06  2.21
    #  9 2023-12-07  6.91
    # 10 2023-12-08  7.35