This seems like an easy solution, but after trying several things and countless searches over the past few hours I can't seem to get it, hopefully someone here can help. I am trying to subset the rows of one dataframe (dfdive
) based on a time interval from another summary dataframe (dfsum
)
My two dataframes:
dfsum
)dfdive
)What I need to do is select only the rows from the second dataframe (dfdive
) that are between:
dfsum$hour_before
and dfsum$start
dfsum$dive_end
and dfsum$hour_after
And then perform a series of functions (sd,mean,median,min,max
) on the selected rows (indicating an hour before the dive and an hour after the dive) for depth and temperature from dfdive
and merge it as a summary line back to the divesum
dataframe according to its dive number.
I have tried several iterations to extract the times using dplyr
as I am most familiar with it:
library(dplyr)
dfdive %>%
group_by(Ptt) %>%
filter(Date >= as_datetime(dfsum$hour_before) & Date <= as_datetime(dfsum$start))
but this does not work and neither do any of my other attempts.
Here is the dput of both dataframes.
dfsum:
structure(list(start = structure(c(1436090880, 1436093580, 1436097300
), class = c("POSIXct", "POSIXt"), tzone = "UTC"), ptt = c("144883",
"144883", "144883"), dive = 13:15, hour_before = structure(c(1436087280,
1436089980, 1436093700), class = c("POSIXct", "POSIXt"), tzone = "UTC"),
dive_end = structure(c(1436092320, 1436095680, 1436098610
), class = c("POSIXct", "POSIXt"), tzone = "UTC"), hour_after = structure(c(1436095920,
1436099280, 1436102210), class = c("POSIXct", "POSIXt"), tzone = "UTC")), row.names = c(NA,
-3L), class = c("data.table", "data.frame")
dfdive:
structure(list(Date = structure(c(1436090220, 1436090820, 1436091420,
1436092020, 1436092620, 1436093220, 1436093820, 1436094420, 1436095020,
1436095620, 1436096220, 1436096820, 1436097420, 1436098020, 1436098620,
1436099220, 1436099820, 1436100420, 1436101020, 1436101620, 1436102220,
1436102820, 1436103420), tzone = "UTC", class = c("POSIXct",
"POSIXt")), Depth = c(76.5, 3, 100, 57.5, 4, 3.5, 39, 74.5, 96.5,
21.5, 3.5, 4, 15, 121, 13.5, 4, 3, 7, 7, 5, 5, 4.5, 8.5), Temp = c(26.45,
29.35, 25.15, 27.8, 29.5, 29.45, 28.7, 26.95, 25.65, 29.1, 29.6,
29.65, 29.35, 23.2, 29.2, 29.8, 29.8, 29.35, 29.35, 29.5, 29.45,
29.95, 29.45), `Light Level` = c(147.75, 201.5, 129.25, 161.5,
200.75, 184.25, 164.5, 142.5, 133.75, 175.75, 186.5, 195, 175.75,
119.25, 179.25, 189.75, 181.5, 182, 180.5, 186.25, 180.75, 181,
179), Ptt = c(144883, 144883, 144883, 144883, 144883, 144883,
144883, 144883, 144883, 144883, 144883, 144883, 144883, 144883,
144883, 144883, 144883, 144883, 144883, 144883, 144883, 144883,
144883)), row.names = c(NA, -23L), class = c("data.table", "data.frame"
)
Note:
dfdive
from 10 second intervals to 10 minute intervals for ease of posting here.The newest development version of dplyr
has huge improvements on joins. You can install is using:
install.packages("devtools")
devtools::install_github("tidyverse/dplyr")
First you need to convert both ptt/Ptt
columns to the same format. In your example, one is character and one is numeric.
dfsum2 <- dfsum %>%
mutate(ptt = as.numeric(ptt))
Now you can use the new overlap join. We compute twice because you have two intervals. I imagine you want to combine these two together.
before <- dfdive %>%
right_join(dfsum2, by = join_by("Ptt" == "ptt", between(Date, hour_before, start)))
after <- dfdive %>%
right_join(dfsum2, by = join_by("Ptt" == "ptt", between(Date, dive_end, hour_after)))
bind_rows(before, after) %>%
group_by(Ptt, dive) %>%
summarize(across(c(start, hour_before, dive_end, hour_after), first),
across(.cols = c(Depth, Temp,`Light Level`),
.fns = c(sd = sd, mean = mean, median = median, min = min, max = max)),
.groups = "drop")
Finally, you can use group_by
/summarize
/across
to compute the statistics.