Search code examples
rapplypurrrrolling-computationrunner

Vectorize loops when calculating rolling means with variable amounts of data


I have a dataframe of daily water chemistry values taken from deployed sensors. I’m trying to calculate rolling 7 day averages of daily maximum values. This in in-situ environmental data, the data can be a bit messy.

Here are the rules for calculating the averages and assigning quality levels:

  • Data is graded and given a quality value (DQL) for the day (dyDQL).
  • 'A' is high quality, 'B' is medium, and 'E' is poor.
  • 7 day average is calculated at the end of a 7 day period.
  • Dataset needs only 6 complete days to calculate a 7 day average (Can miss 1 day of data)
  • If there are at least 6 days worth ‘A’ and ‘B’ graded data and 1 day of ‘E’, discard ‘E’ data and calculate the 7-day average using the 6 days of ‘A’ and ‘B’ data

I have the code working using a loop that loops through each result, creates a new dataframe containing the 7 day window, and then calculates the moving average. See minimal example below.

Note that there are missing dates for the 11th, 16th, 17th, and 18th in this example:

  daily_data <- tibble::tribble(
    ~Monitoring.Location.ID,        ~date, ~dyMax, ~dyMin, ~dyDQL,
    "River 1", as.Date("2018-07-01"), 24.219, 22.537,    "A",
    "River 1", as.Date("2018-07-02"), 24.557, 20.388,    "A",
    "River 1", as.Date("2018-07-03"), 24.847, 20.126,    "A",
    "River 1", as.Date("2018-07-04"), 25.283, 20.674,    "A",
    "River 1", as.Date("2018-07-05"), 25.501, 20.865,    "A",
    "River 1", as.Date("2018-07-06"),  25.04, 21.008,    "A",
    "River 1", as.Date("2018-07-07"), 24.847, 20.674,    "A",
    "River 1", as.Date("2018-07-08"), 23.424, 20.793,    "B",
    "River 1", as.Date("2018-07-09"), 22.657, 18.866,    "E",
    "River 1", as.Date("2018-07-10"), 22.298,   18.2,    "A",
    "River 1", as.Date("2018-07-12"),  22.92, 19.008,    "A",
    "River 1", as.Date("2018-07-13"), 23.978, 19.532,    "A",
    "River 1", as.Date("2018-07-14"), 24.508, 19.936,    "A",
    "River 1", as.Date("2018-07-15"), 25.137, 20.627,    "A",
    "River 1", as.Date("2018-07-19"), 24.919, 20.674,    "A"
  )
  
for (l in seq_len(nrow(daily_data))){
  
  
  station_7day <- filter(daily_data,
                         dplyr::between(date, daily_data[[l,'date']] - lubridate::days(6), daily_data[l,'date']))
  
  
  daily_data[l,"ma.max7"] <- dplyr::case_when(nrow(subset(station_7day, dyDQL %in% c("A")))== 7 & l >=7  ~ mean(station_7day$dyMax),
                                              nrow(subset(station_7day, dyDQL %in% c("A", 'B'))) >= 6 & l >=7~ mean(station_7day$dyMax),
                                              max(station_7day$dyDQL == 'E') & nrow(subset(station_7day, dyDQL %in% c("A", "B"))) >= 6  & l >=7 ~ mean(station_7day$dyMax[station_7day$dyDQL %in% c("A", "B")]),
                                              nrow(subset(station_7day, dyDQL %in% c("A", "B", "E"))) >= 6 & l >=7~  mean(station_7day$dyMax),
                                              TRUE ~ NA_real_)
  
  daily_data[l, "ma.max7_DQL"] <-  dplyr::case_when(nrow(subset(station_7day, dyDQL %in% c("A")))== 7 & l >=7  ~ "A",
                                                    nrow(subset(station_7day, dyDQL %in% c("A", 'B'))) >= 6 & l >=7~ "B",
                                                    max(station_7day$dyDQL == 'E') & nrow(subset(station_7day, dyDQL %in% c("A", "B"))) >= 6  & l >=7 ~ "B",
                                                    nrow(subset(station_7day, dyDQL %in% c("A", "B", "E"))) >= 6 & l >=7~ "E",
                                                    TRUE ~ NA_character_)
  
  
}

The expected results are:

tibble::tribble(
  ~Monitoring.Location.ID,        ~date, ~dyMax, ~dyMin, ~dyDQL,         ~ma.max7, ~ma.max7_DQL,
                "River 1", as.Date("2018-07-01"), 24.219, 22.537,    "A",               NA,           NA,
                "River 1", as.Date("2018-07-02"), 24.557, 20.388,    "A",               NA,           NA,
                "River 1", as.Date("2018-07-03"), 24.847, 20.126,    "A",               NA,           NA,
                "River 1", as.Date("2018-07-04"), 25.283, 20.674,    "A",               NA,           NA,
                "River 1", as.Date("2018-07-05"), 25.501, 20.865,    "A",               NA,           NA,
                "River 1", as.Date("2018-07-06"),  25.04, 21.008,    "A",               NA,           NA,
                "River 1", as.Date("2018-07-07"), 24.847, 20.674,    "A", 24.8991428571429,          "A",
                "River 1", as.Date("2018-07-08"), 23.424, 20.793,    "B", 24.7855714285714,          "B",
                "River 1", as.Date("2018-07-09"), 22.657, 18.866,    "E", 24.5141428571429,          "B",
                "River 1", as.Date("2018-07-10"), 22.298,   18.2,    "A",            24.15,          "B",
                "River 1", as.Date("2018-07-12"),  22.92, 19.008,    "A",           23.531,          "E",
                "River 1", as.Date("2018-07-13"), 23.978, 19.532,    "A",           23.354,          "E",
                "River 1", as.Date("2018-07-14"), 24.508, 19.936,    "A",          23.2975,          "E",
                "River 1", as.Date("2018-07-15"), 25.137, 20.627,    "A",           23.583,          "E",
                "River 1", as.Date("2018-07-19"), 24.919, 20.674,    "A",               NA,           NA
  )

The code works fine, but is very slow when calculating values for multi-year levels of data with multiple different water quality parameters at multiple locations.

Due to the fact that a 7 day value can be calculated from 6 days of data, I don’t think I can use any of the rolling functions from the zoo package. I don’t think I can use the roll_mean function from the roll package, due to the variable nature of discarding 1 days worth of ‘E’ data when there is 6 days of ‘A’ or ‘B’ data.

Is there way to vectorize this, in order to avoid looping through every row of data?


Solution

  • I used tidyverse and runner and have done it like this in a single piped syntax. Syntax explanation-

    • I first collected seven days (as per logic provided) DQL and MAX values into a list using runner.
    • Before doing that, I have converted DQL into an ordered factored variable, which will be used in last syntax.
    • Secondly, i used purrr::map to modify each list according to given conditions,
      • Not less than six are to be counted
      • If there is exactly one E in 7 values, that has not to be counted.
    • Finally I unnested the list using unnest_wider
    library(runner)
    daily_data %>% mutate(dyDQL = factor(dyDQL, levels = c("A", "B", "E"), ordered = T),
                          d = runner(x = data.frame(a = dyMax, b= dyDQL),
                                       k = "7 days",
                                       lag = 0,
                                       idx = date,
                                       f = function(x) list(x))) %>%
      mutate(d = map(d, ~ .x %>% group_by(b) %>%
                         mutate(c = n()) %>%
                         ungroup() %>%
                         filter(!n() < 6) %>%
                         filter(!(b == 'E' & c == 1 & n() == 7)) %>%
                         summarise(ma.max7 = ifelse(n() == 0, NA, mean(a)), ma.max7.DQL = max(b))
                       )
             ) %>%
      unnest_wider(d)
    
    # A tibble: 15 x 7
       Monitoring.Location.ID date       dyMax dyMin dyDQL ma.max7 ma.max7.DQL
       <chr>                  <date>     <dbl> <dbl> <ord>   <dbl> <ord>      
     1 River 1                2018-07-01  24.2  22.5 A        NA   NA         
     2 River 1                2018-07-02  24.6  20.4 A        NA   NA         
     3 River 1                2018-07-03  24.8  20.1 A        NA   NA         
     4 River 1                2018-07-04  25.3  20.7 A        NA   NA         
     5 River 1                2018-07-05  25.5  20.9 A        NA   NA         
     6 River 1                2018-07-06  25.0  21.0 A        24.9 A          
     7 River 1                2018-07-07  24.8  20.7 A        24.9 A          
     8 River 1                2018-07-08  23.4  20.8 B        24.8 B          
     9 River 1                2018-07-09  22.7  18.9 E        24.8 B          
    10 River 1                2018-07-10  22.3  18.2 A        24.4 B          
    11 River 1                2018-07-12  22.9  19.0 A        23.5 E          
    12 River 1                2018-07-13  24.0  19.5 A        23.4 E          
    13 River 1                2018-07-14  24.5  19.9 A        23.3 E          
    14 River 1                2018-07-15  25.1  20.6 A        23.6 E          
    15 River 1                2018-07-19  24.9  20.7 A        NA   NA