Search code examples
rdataframe

How to round datetime column values to hour by groups of differing lengths?


Example data

My dataframe is from a high-frequency buoy taking measurements at discrete depths in a lake.

library(rLakeAnalyzer)
library(tidyverse)
library(plyr)
PFL_counter = c(2,2,2,3,3,3,4,4)
depth = c(0.5,1.0,1.5,0.5,1.0,1.5,0.5,1.0)
temp_C = c(14.27,14.22,14.20,14.23,14.23,14.22,14.23,14.22)
datetime = c("5/11/23 17:01","5/11/23 17:02","5/11/23 17:04",
             "5/11/23 18:01","5/11/23 18:22","5/11/23 19:14",
             "5/11/23 19:14","5/11/23 19:16")

Carmi_23 = data.frame(PFL_counter, depth ,temp_C, datetime)
Carmi_23$datetime = mdy_hm(Carmi_23$datetime)

Carmi_23

The actual data have 58522 observations, depth increases from 0.5 to 9 MAX (e.g. sometimes will only go to 7) by 0.5 increments, the counter increases by 1 for every depth profile completed (every time depth goes back to 0.5).

The goal

There should be 1 depth profile for each hour. My goal is to have each datetime rounded to the hour for its depth profile (you could think of this as its PFL_counter group). Like so:

PFL_counter = c(2,2,2,3,3,3,4,4)
depth = c(0.5,1.0,1.5,0.5,1.0,1.5,0.5,1.0)
temp_C = c(14.27,14.22,14.20,14.23,14.23,14.22,14.23,14.22)
datetime = c("5/11/23 17:00","5/11/23 17:00","5/11/23 17:00",
             "5/11/23 18:00","5/11/23 18:00","5/11/23 18:00",
             "5/11/23 19:00","5/11/23 19:00")

Carmi_23 = data.frame(PFL_counter, depth ,temp_C, datetime)
Carmi_23$datetime = mdy_hm(Carmi_23$datetime)

Carmi_23

The issue and my attempts

The issue is that for some depth profiles it took over an hour, so the last couple of depths are not within the same hour as when the profile began (at 0.5 depth). Recreated issue in first code chunk, above, in row 6.

My first approach:

Carmi_23$datetime = floor_date(Carmi_23$datetime, "hour") # tried this

Carmi_23$datetime = round_date(Carmi_23$datetime, "10 mins") # also this
Carmi_23$datetime = floor_date(Carmi_23$datetime, "hour")

This becomes an issue when I need to pivot_wider because then I get half a row for one hour and half a row for another (and I can't have NAs for rLakeAnalyzer heat map). In the example data:

testingpivot = Carmi_23 %>%
  select(datetime=datetime, depth=depth, temp_C=temp_C) %>%
  pivot_wider(id_cols=datetime,
              names_from=depth,
              values_from=temp_C,
              values_fn=function(x) mean(x, na.rm=TRUE)
  ) %>%
  rename_with(~ paste0("wtr_", .), -datetime)
#%>% na.omit() leave until figured out
testingpivot

I also looked into openair

library(openair)
Carmi_23$date = Carmi_23$datetime
Carmi_openair = timeAverage(Carmi_23, avg.time = "hour", type = "PFL_counter") # datetime needs to be called "date"

But this averages all the column values so that the time is collapsed into one row per avg hour...

My next attempts are trying to group_by and mean and summarize (take the average by PFL_counter, and then round down), but the key is that I need the datetime to repeat for all depths within its respective depth profile AND the length of depths per PFL_counter is unpredictable (in example data I replicated this by having 3 rows for two depth profiles and 2 rows for one).

Does anyone have any ideas? I could just NA omit rows after pivoting for which the hour got split in half, but a very stubborn part of me wants to figure it out.


Solution

  • Base R solution:

    Carmi_23$datetime <- ave(
      Carmi_23$datetime,
      Carmi_23$PFL_counter,
      FUN = function (x)
        min(lubridate::floor_date(x, "hour"))
    )
    

    dplyr solution:

    library(dplyr)
    
    Carmi_23 |>
      mutate(datetime = min(lubridate::floor_date(datetime, "hour")), .by = PFL_counter) 
    

    Result:

    > Carmi_23
      PFL_counter depth temp_C            datetime
    1           2   0.5  14.27 2023-05-11 17:00:00
    2           2   1.0  14.22 2023-05-11 17:00:00
    3           2   1.5  14.20 2023-05-11 17:00:00
    4           3   0.5  14.23 2023-05-11 18:00:00
    5           3   1.0  14.23 2023-05-11 18:00:00
    6           3   1.5  14.22 2023-05-11 18:00:00
    7           4   0.5  14.23 2023-05-11 19:00:00
    8           4   1.0  14.22 2023-05-11 19:00:00
    

    Data:

    > dput(Carmi_23)
    structure(list(PFL_counter = c(2, 2, 2, 3, 3, 3, 4, 4), depth = c(0.5, 
    1, 1.5, 0.5, 1, 1.5, 0.5, 1), temp_C = c(14.27, 14.22, 14.2, 
    14.23, 14.23, 14.22, 14.23, 14.22), datetime = structure(c(1683824460, 
    1683824520, 1683824640, 1683828060, 1683829320, 1683832440, 1683832440, 
    1683832560), class = c("POSIXct", "POSIXt"), tzone = "UTC")), class = "data.frame", row.names = c(NA, 
    -8L))