Search code examples
rloopsaveragerolling-average

How to calculate a 3-day centered rolling average for measurements with varying numbers of measurements per date in R?


I have a dataset in R that contains methane measurements taken on different dates by of different cows. The dataset is structured as follows:

  • ID: An identifier for each cow (integer).
  • measure_date: The date when the measurements were taken (Date).
  • ch4: The methane measurement values (numeric).

Each date can have multiple measurements, and some dates might not have any measurements at all, so they do not appear in the dataset. I need to calculate a 3-day centered rolling average of the methane measurements (ch4) for each animal (id), considering all available measurements within the date range.

Key Requirements: Multiple measurements per date: The dataset includes multiple measurements for some dates, but can have a completely different number for another date depending on how many times the animal went to the feed bin, and some days might have no measurements. If the date has no measurement it doesn’t actually appear as a row in the data, but I can easily add in those rows if necessary.

3-day centered rolling average: The rolling average should be calculated over a 3-day window centered on each date. For example, the rolling average for January 2nd should consider all measurements from January 1st, 2nd, and 3rd.

Group by animal: The rolling average calculation should be done separately for each animal (ID), ensuring that the calculations for one animal does not affect another.

Example Dataset
Here's a small sample of what the dataset might look like for one techid:

Meausre_date = c(2023-01-01, 2023-01-01, 2023-01-01, 2023-01-01, 2023-01-02, 2023-01-03 2023-01-03, 2023-01-05)
Ch4 = c(200, 250, 233, 256, 270, 256, 290, 299)

Desired output
For each techid, I want to generate a new column that includes:

  • The original dates (including those with no measurements).
  • The average 3 day centered rolling average methane measurement for each date.

To summarize, I need an efficient way to compute a 3-day centered rolling average of methane measurements for each animal, considering varying numbers of measurements per date and including dates with no measurements in the sequence.

All of the methods I have seen and attempted so far haven’t been able to deal with the different number of measures on each measure_date. (it is a very large dataset) example of dataset

I presume I am missing something simple or just attacking this from the wrong angle. (changed some variable names in the code and dataset which is why there are differences)


Solution

  • Using the data shown reproducibly in the Note at the end, define a function, Mean, which makes centered means and then for each tech_id complete the dates using complete to insert missing dates and then run Mean on that.

    library (dplyr)
    library(tidyr)
    
    # x is times and y the values; returns rolling mean of 3 centered
    Mean <- function(x, y) {
      x <- as.numeric(x)
      sapply(x, function(z) mean(y[x %in% seq(z-1, z+1)], na.rm = TRUE))
    }
    
    DF %>%
      mutate(Measure_date = as.Date(Measure_date)) %>%
      group_by(tech_id) %>%
      complete(Measure_date = seq(min(Measure_date), max(Measure_date), by="day")) %>%
      mutate(Mean = Mean(Measure_date, Ch4) ) %>% 
      ungroup
    

    giving

    # A tibble: 9 × 4
      tech_id Measure_date   Ch4  Mean
        <dbl> <date>       <dbl> <dbl>
    1       1 2023-01-01     200  242.
    2       1 2023-01-01     250  242.
    3       1 2023-01-01     233  242.
    4       1 2023-01-01     256  242.
    5       1 2023-01-02     270  251.
    6       1 2023-01-03     256  272 
    7       1 2023-01-03     290  272 
    8       1 2023-01-04      NA  282.
    9       1 2023-01-05     299  299 
    

    Note

    DF <- data.frame(tech_id = 1,
      Measure_date = c("2023-01-01", "2023-01-01", "2023-01-01", "2023-01-01", 
      "2023-01-02", "2023-01-03", "2023-01-03", "2023-01-05"), 
      Ch4 = c(200, 250, 233, 256, 270, 256, 290, 299))