Search code examples
rdplyrtime-seriessubset

How to select a time between two intervals from different dataframe, perform functions and remerge to original dataframe


This seems like an easy solution, but after trying several things and countless searches over the past few hours I can't seem to get it, hopefully someone here can help. I am trying to subset the rows of one dataframe (dfdive) based on a time interval from another summary dataframe (dfsum)

My two dataframes:

  1. Summary dive data (dfsum)

Dive Summary

  1. Dive Data (dfdive)

Dive Data

What I need to do is select only the rows from the second dataframe (dfdive) that are between:

  1. dfsum$hour_before and dfsum$start
  2. dfsum$dive_end and dfsum$hour_after

And then perform a series of functions (sd,mean,median,min,max) on the selected rows (indicating an hour before the dive and an hour after the dive) for depth and temperature from dfdive and merge it as a summary line back to the divesum dataframe according to its dive number.

i.e. desired output Desired  output

I have tried several iterations to extract the times using dplyr as I am most familiar with it:

library(dplyr)
dfdive %>%
  group_by(Ptt) %>% 
  filter(Date >= as_datetime(dfsum$hour_before) & Date <= as_datetime(dfsum$start))

but this does not work and neither do any of my other attempts.

Here is the dput of both dataframes.

dfsum:

structure(list(start = structure(c(1436090880, 1436093580, 1436097300
), class = c("POSIXct", "POSIXt"), tzone = "UTC"), ptt = c("144883", 
"144883", "144883"), dive = 13:15, hour_before = structure(c(1436087280, 
1436089980, 1436093700), class = c("POSIXct", "POSIXt"), tzone = "UTC"), 
    dive_end = structure(c(1436092320, 1436095680, 1436098610
    ), class = c("POSIXct", "POSIXt"), tzone = "UTC"), hour_after = structure(c(1436095920, 
    1436099280, 1436102210), class = c("POSIXct", "POSIXt"), tzone = "UTC")), row.names = c(NA, 
-3L), class = c("data.table", "data.frame")

dfdive:

 structure(list(Date = structure(c(1436090220, 1436090820, 1436091420, 
    1436092020, 1436092620, 1436093220, 1436093820, 1436094420, 1436095020, 
    1436095620, 1436096220, 1436096820, 1436097420, 1436098020, 1436098620, 
    1436099220, 1436099820, 1436100420, 1436101020, 1436101620, 1436102220, 
    1436102820, 1436103420), tzone = "UTC", class = c("POSIXct", 
    "POSIXt")), Depth = c(76.5, 3, 100, 57.5, 4, 3.5, 39, 74.5, 96.5, 
    21.5, 3.5, 4, 15, 121, 13.5, 4, 3, 7, 7, 5, 5, 4.5, 8.5), Temp = c(26.45, 
    29.35, 25.15, 27.8, 29.5, 29.45, 28.7, 26.95, 25.65, 29.1, 29.6, 
    29.65, 29.35, 23.2, 29.2, 29.8, 29.8, 29.35, 29.35, 29.5, 29.45, 
    29.95, 29.45), `Light Level` = c(147.75, 201.5, 129.25, 161.5, 
    200.75, 184.25, 164.5, 142.5, 133.75, 175.75, 186.5, 195, 175.75, 
    119.25, 179.25, 189.75, 181.5, 182, 180.5, 186.25, 180.75, 181, 
    179), Ptt = c(144883, 144883, 144883, 144883, 144883, 144883, 
    144883, 144883, 144883, 144883, 144883, 144883, 144883, 144883, 
    144883, 144883, 144883, 144883, 144883, 144883, 144883, 144883, 
    144883)), row.names = c(NA, -23L), class = c("data.table", "data.frame"
    )

Note:

  1. I have decreased the resolution of dfdive from 10 second intervals to 10 minute intervals for ease of posting here.
  2. I have also selected the shortest consecutive dives from my dataframe (dives 13:15)
  3. Ptt is the ID number for the animal- I need to group it by Ptt as I have 22 animals

Solution

  • The newest development version of dplyr has huge improvements on joins. You can install is using:

    install.packages("devtools")
    devtools::install_github("tidyverse/dplyr")
    

    First you need to convert both ptt/Ptt columns to the same format. In your example, one is character and one is numeric.

    dfsum2 <- dfsum %>% 
        mutate(ptt = as.numeric(ptt))
    

    Now you can use the new overlap join. We compute twice because you have two intervals. I imagine you want to combine these two together.

    before <- dfdive %>% 
        right_join(dfsum2, by = join_by("Ptt" == "ptt", between(Date, hour_before, start)))
    after <- dfdive %>% 
        right_join(dfsum2, by = join_by("Ptt" == "ptt", between(Date, dive_end, hour_after)))
    
    bind_rows(before, after) %>% 
        group_by(Ptt, dive) %>% 
        summarize(across(c(start, hour_before, dive_end, hour_after), first),
                  across(.cols = c(Depth, Temp,`Light Level`), 
                       .fns = c(sd = sd, mean = mean, median = median, min = min, max = max)),
                  .groups = "drop")
    

    Finally, you can use group_by/summarize/across to compute the statistics.