Search code examples
rgroup-bymeanlubridatemutate

Looking for an easier way to calculate daily and hourly averages in R


I am able to solve my problem successfully using the following code in R. But I am looking if there is a possibility to do it in a simpler way. I am calculating 60min average and daily average. The preprocessed data is stored in 'merged' object. Now there are certain protocols of my task such as (i) Time should be in yyyy-mm-dd hh:mm:ss+00 format (ii) daily average should be computed using 60min average only (and not using 'merged' object).

Code:

#Cal 60min average
avghr <- merged %>%
 select(any_of(cn)) %>%          #cn includes the selected columns in the output  
 as_tibble() %>%
 group_by(Time_sp = lubridate::floor_date(`Date/Time`, "60 mins")) %>% 
 mutate(Time_sp = format(Time_sp, "%Y-%m-%d %H:%M:%S+00")) %>%
 summarise(across(where(is.numeric), ~ if(mean(is.na(.x)) > 0.5) NA else mean(.x, na.rm = TRUE)))
write.csv(avghr, paste0(dirlist[idx],"_hr.csv"), row.names = FALSE)

#60min average to be used for daily average calc
avghr1 <- merged %>%
 select(any_of(cn)) %>%
 as_tibble() %>%
 group_by(Time_sp = lubridate::floor_date(`Date/Time`, "60 mins")) %>% 
 summarise(across(where(is.numeric), ~ if(mean(is.na(.x)) > 0.5) NA else mean(.x, na.rm = TRUE)))
 
#Calc daily average
avgdl <- avghr1 %>%
 group_by(Time_sp = lubridate::floor_date(`Time_sp`, "1 day")) %>%
 mutate(Time_sp = format(Time_sp, "%Y-%m-%d %H:%M:%S+00")) %>%
 summarise(across(where(is.numeric), ~ if(sum(is.na(.x)) > 1) NA else mean(.x, na.rm = TRUE)))
write.csv(avgdl, paste0(dirlist[idx],"_dly.csv"), row.names = FALSE)

To have Time in the specified format, I have used mutate function but mutate function converts Date/Time format to string format. Thus, Time_sp column in avghr is no longer in Date/Time format, hence, is no longer valid for grouping of 1 day and further calculation of daily average. Thus, as a way out, I have calculated 60min average as avghr, used mutate to have Time in the required format and print it as .csv output. Then I again calculated 60min average as avghr1, kept Time_sp in Date/Time format as intact and, therefore successfully used avghr1 to calculate daily average avgdl. Is there any way to avoid calculating avghr1 and still achieve the same results in desired manner. I hope that I have made my problem clear.

Input data

Date/Time C1 [C1c] C2 [C2c] C3 [C3c] C4 [C3c] C5 [h]               
2021-03-01S00:00:00 267 295 250 .....
2021-03-01S00:01:00 .....
2021-03-01S00:02:00 .....
..

dput() of a sample input .txt file

structure(list(`Date/Time` = structure(c(1614556800, 1614556860, 
1614556920, 1614556980, 1614557040, 1614557100), tzone = "UTC", class = 
c("POSIXct", "POSIXt")), `XY [XY]` = c(0.990641, 0.990641, 0.990641, 
0.990641, 0.990641, 0.990641), `C1 [C1]` = c(257, 257, 257, 256, 255, 
255), Cc = c(0, 0, 0, 0, 0, 0), `C2 [C2]` = c(285, 284, 289, 264, 
231, 223), Dc = c(0, 0, 0, 0, 0, 0), `C3 [C3]` = c(255, 255, 
255, 255, 254, 254), Ac = c(0, 0, 0, 0, 0, 0), C4 = c(0.463735, 
0.465678, 0.467612, 0.469561, 0.471472, 0.473374), `C5 [h]` = c(1013, 
NA, NA, NA, NA, NA), `C6 [%]` = c(43, NA, NA, NA, NA, NA), `C7 [E2]` = 
c(390, 390, 393, 380, 365, 361), Jc = c(0, 0, 0, 0, 0, 0), `D [S]` = 
c(62.3716, 62.2459, 62.1206, 61.9942, 61.8701, 61.7465), `Sw [S2]` = 
c(1392.95, 1392.95, 1392.95, 1392.95, 1392.95, 1392.95), `SW [Q2]` = 
c(389.164, 389.253, 392.14, 379.964, 363.91, 360.562), `QA [H2]` = 
c(646.61, 649.313, 652.002, 654.712, 657.371, 660.016), `T2 [C]` = 
c(3.7, NA, NA, NA, NA, NA), Lc = c(0, 0, 0, 0, 0, 0)), row.names = c(NA, 
-6L), class = c("tbl_df", "tbl", "data.frame"))

Desired output

Hourly output
Time_sp C1 [C1c] C3 [C3c] C5 [C5c]....               
2021-03-01 01:00:00+00 257 285 255 .....
2021-03-01 02:00:00+00  .....
2021-03-01 03:00:00+00  .....
..

Daily output
Time_sp C1 [C1c] C3 [C3c] C5 [C5c]...               
2021-03-01 00:00:00+00 257 285 255 .....
2021-03-02 00:00:00+00  .....
2021-03-03 00:00:00+00  .....
..

Solution

  • Create a function for formatting and writing, let it retun input dataframe inivisibly so it would not affect further processing when used in a pipeline:

    library(dplyr)
    
    # write csv with custom datetime format, return input data.frame as-is, ivisibly
    write.csv_custom_datetime <- function(df_, filename){
      mutate(df_, Time_sp = format(Time_sp, "%Y-%m-%d %H:%M:%S+00")) %>%
        write.csv(filename, row.names = FALSE)  
      invisible(df_)
    }
    
    avghr <- merged %>%
      # select(any_of(cn)) %>%          #cn includes the selected columns in the output  
      as_tibble() %>%
      group_by(Time_sp = lubridate::floor_date(`Date/Time`, "60 mins")) %>% 
      summarise(across(where(is.numeric), ~ if(mean(is.na(.x)) > 0.5) NA else mean(.x, na.rm = TRUE))) %>% 
      # write CSV while keeping POSIXct for avghr
      write.csv_custom_datetime(paste0(dirlist[idx],"_hr.csv"))
    # CSV file content:
    # readLines(paste0(dirlist[idx],"_hr.csv")) %>% stringr::str_trunc(80) %>% paste0(collapse = "\n") %>% cat()
    #> "Time_sp","XY [XY]","C1 [C1]","Cc","C2 [C2]","Dc","C3 [C3]","Ac","C4","C5 [h]...
    #> "2021-03-01 00:00:00+00",0.990641,256.166666666667,0,262.666666666667,0,254.6...
    
    #Calc daily average
    avgdl <- avghr %>%
      group_by(Time_sp = lubridate::floor_date(`Time_sp`, "1 day")) %>%
      summarise(across(where(is.numeric), ~ if(sum(is.na(.x)) > 1) NA else mean(.x, na.rm = TRUE))) %>% 
      write.csv_custom_datetime(paste0(dirlist[idx],"_dly.csv"))
    # CSV file content:
    # readLines(paste0(dirlist[idx],"_dly.csv")) %>% stringr::str_trunc(80) %>% paste0(collapse = "\n") %>% cat()
    #> "Time_sp","XY [XY]","C1 [C1]","Cc","C2 [C2]","Dc","C3 [C3]","Ac","C4","C7 [E2...
    #> "2021-03-01 00:00:00+00",0.990641,256.166666666667,0,262.666666666667,0,254.6...
    

    Prepare reprex:

    merged  <- structure(list(`Date/Time` = structure(c(1614556800, 1614556860, 
    1614556920, 1614556980, 1614557040, 1614557100), tzone = "UTC", class = 
    c("POSIXct", "POSIXt")), `XY [XY]` = c(0.990641, 0.990641, 0.990641, 
    0.990641, 0.990641, 0.990641), `C1 [C1]` = c(257, 257, 257, 256, 255, 
    255), Cc = c(0, 0, 0, 0, 0, 0), `C2 [C2]` = c(285, 284, 289, 264, 
    231, 223), Dc = c(0, 0, 0, 0, 0, 0), `C3 [C3]` = c(255, 255, 
    255, 255, 254, 254), Ac = c(0, 0, 0, 0, 0, 0), C4 = c(0.463735, 
    0.465678, 0.467612, 0.469561, 0.471472, 0.473374), `C5 [h]` = c(1013, 
    NA, NA, NA, NA, NA), `C6 [%]` = c(43, NA, NA, NA, NA, NA), `C7 [E2]` = 
    c(390, 390, 393, 380, 365, 361), Jc = c(0, 0, 0, 0, 0, 0), `D [S]` = 
    c(62.3716, 62.2459, 62.1206, 61.9942, 61.8701, 61.7465), `Sw [S2]` = 
    c(1392.95, 1392.95, 1392.95, 1392.95, 1392.95, 1392.95), `SW [Q2]` = 
    c(389.164, 389.253, 392.14, 379.964, 363.91, 360.562), `QA [H2]` = 
    c(646.61, 649.313, 652.002, 654.712, 657.371, 660.016), `T2 [C]` = 
    c(3.7, NA, NA, NA, NA, NA), Lc = c(0, 0, 0, 0, 0, 0)), row.names = c(NA, 
    -6L), class = c("tbl_df", "tbl", "data.frame"))
    
    dirlist <- c("tmp")
    idx <- 1
    

    Created on 2023-05-31 with reprex v2.0.2