Search code examples
rtime-seriesaveragexts

Averaging data at specific hours across days in R


  • Update(2021/5/13, 11:55AM): The data is not reproducible. Thank you @G. Grothendieck for giving me a comment.

(This is my first post in Stackoverflow, so please let me know if there's any basic mistakes I made in this question)

Hi,

I have a hourly-measured data of air quality over multiple days in R, and I would like to calculate the average air quality at specific time periods.

Here's a subset of my reproducible data. It is in xts format.

# Make a structure of data
dput(Air_sample[1:6,1:1])

# Create a data from the structure above.
Air <- structure(
 c(2.6, 2, 2.2, 2.2, 1.6, 1.2),
 class = c("xts", "zoo"),
 index = structure(
  c(
   1078959600,
   1078963200,
   1079046000,
   1079049600,
   1079132400,
   1079136000
  ),
  tzone = "",
  tclass = c("POSIXct",
             "POSIXt")
 ),
 .Dim = c(6L, 1L),
 .Dimnames = list(NULL, c("True.CO")))

> Air
                    True.CO
2004-03-10 18:00:00     2.6
2004-03-10 19:00:00     2.0
2004-03-11 18:00:00     2.2
2004-03-11 19:00:00     2.2
2004-03-12 18:00:00     1.6
2004-03-12 19:00:00     1.2

I want to calculate average CO at specific time (ex. 6PM) from multiple days. So the result would be something like below.

Air_average <- data.frame("Time" = c("18:00","19:00"), "Average CO" = c(2.1333,1.8))

> Air_average
   Time Average.CO
1 18:00     2.1333
2 19:00     1.8000

I tried different function via Googling such as "period.apply", "subset", "window", etc. But none of them seems to work.

Is there any way to do this?

Thank you.


Solution

  • You can use something like dplyr to do grouping operation, and lubridate to deal with dates. lubridate has the hour function which return only the hours. I first convert your data into a data frame:

    library(lubridate)
    library(dplyr)
    library(xts)
    
    Air <- data.frame(Air) %>% 
      add_rownames(var = "time")
    
      time                True.CO
      <chr>                 <dbl>
    1 2004-03-11 00:00:00     2.6
    2 2004-03-11 01:00:00     2  
    3 2004-03-12 00:00:00     2.2
    4 2004-03-12 01:00:00     2.2
    5 2004-03-13 00:00:00     1.6
    6 2004-03-13 01:00:00     1.2
    

    Because of my timezone, the hours are not the same than yours, but the code will be the same.

    Air %>%
      group_by(hour(time))%>%
      summarise(mean(True.CO))
    
    # A tibble: 2 x 2
      `hour(time)` `mean(True.CO)`
             <int>           <dbl>
    1            0            2.13
    2            1            1.8