Search code examples
arraysdataframeextractaverageopenair

How to extract multiple 5-minute averages from a data frame based on specified start time?


I have second-by-second data for channels A, B, and C as shown below (this just shows the first 6 rows):

                 date        A        B        C
1 2020-03-06 09:55:42 224.3763 222.3763 226.3763
2 2020-03-06 09:55:43 224.2221 222.2221 226.2221
3 2020-03-06 09:55:44 224.2239 222.2239 226.2239
4 2020-03-06 09:55:45 224.2044 222.2044 226.2044
5 2020-03-06 09:55:46 224.2397 222.2397 226.2397
6 2020-03-06 09:55:47 224.3690 222.3690 226.3690

I would like to be able to extract multiple 5-minute averages for columns A, B and C based off time. Is there a way to do this where I would only need to type in the starting time period, rather than having to type the start AND end times for each time period I want to extract? Essentially, I want to be able to type the start time and have my code calculate and extract the average for the successive 5 minutes.

I was previously using the 'time.average' function from the 'openair' package to obtain 1-minute averages for the entire data set. I then created a vector with the start times and then used the 'subset' function' to extract the 1 minute averages I was interested in.

library(openair)
df.avg <- timeAverage(df, avg.time = "min", statistic = "mean")
cond.1.time <- c(
  '2020-03-06 10:09:00', 
  '2020-03-06 10:13:00',
  '2020-03-06 10:18:00',
) #enter start times
library(dplyr)
df.cond.1.avg <- subset(df.avg,
                           date %in% cond.1.time)  #filter data based off vector
df.cond.1.avg <- as.data.frame(df.cond.1.avg) #tibble to df

However, this approach will not work for 5-minute averages since not all of the time frames I am interested in begin in 5 minute increments of each other. Also, my previous approach forced me to only use 1 minute averages that start at the top of the minute.

I need to be able to extract 5-minute averages scattered randomly throughout the day. These are not rolling averages. I will need to extract approximately thirty 5-minute averages per day so being able to only type in the start date would be key.

Thank you!


Solution

  • Using the dplyr and tidyr libraries, the interval to be averaged can be selected by filtering the dates and averaged. It doesn't seem to be efficient but it can help you.

    library(dplyr)
    library(tidyr)
    data <- data.frame(date = seq(as.POSIXct("2020-02-01 01:01:01"),
                                  as.POSIXct("2020-02-01 20:01:10"),
                                  by = "sec"),
                       A = rnorm(68410),
                       B = rnorm(68410),
                       C = rnorm(68410))
    
    meanMinutes <- function(data, start, interval){
      # Interval in minutes
      start <- as.POSIXct(start)
      end <- start + 60*interval
      filterData <- dplyr::filter(data, date <= end, date >= start)
      date_start <- filterData$date[1]
      meanData <- filterData %>% 
        tidyr::gather(key = "param", value = "value", A:C) %>% 
        dplyr::group_by(param) %>% 
        dplyr::summarise(value = mean(value, na.rm = T)) %>% 
        tidyr::spread(key = "param", value = "value")
      return(cbind(date_start, meanData))
    }
    

    For one date

    meanMinutes(data, "2020-02-01 07:03:11", 5)
    

    Result:

               date_start           A           B          C
    1 2020-02-01 07:03:11 0.004083064 -0.06067075 -0.1304691
    

    For multiple dates:

    dates <- c("2020-02-01 02:53:41", "2020-02-01 05:23:14", 
               "2020-02-01 07:03:11", "2020-02-01 19:10:45")
    do.call(rbind, lapply(dates, function(x) meanMinutes(data, x, 5)))
    

    Result:

               date_start            A           B           C
    1 2020-02-01 02:53:41 -0.001929374 -0.03807152  0.06072332
    2 2020-02-01 05:23:14  0.009494321 -0.05911055 -0.02698245
    3 2020-02-01 07:03:11  0.004083064 -0.06067075 -0.13046909
    4 2020-02-01 19:10:45 -0.123574816 -0.02373881  0.05997007