I have a dataset with GPS points and I want to remove points that are within a 2-hour period. Here's a sample of the dataset:
gps_data_animals_id acquisition_time
348179 348179 2015-09-18 00:00:00
348180 348180 2015-09-18 01:45:00
348181 348181 2015-09-18 02:00:00
348182 348182 2015-09-18 02:15:00
348183 348183 2015-09-18 02:30:00
348184 348184 2015-09-18 04:30:00
348185 348185 2015-09-18 04:45:00
348186 348186 2015-09-18 05:00:00
348187 348187 2015-09-18 06:00:00
348188 348188 2015-09-18 12:00:00
348189 348189 2015-09-18 17:15:00
348190 348190 2015-09-18 17:30:00
348191 348191 2015-09-18 17:45:00
348192 348192 2015-09-18 18:00:00
348193 348193 2015-09-18 18:15:00
348194 348194 2015-09-18 18:30:00
348195 348195 2015-09-18 18:45:00
348196 348196 2015-09-19 00:00:00
348197 348197 2015-09-19 06:01:00
348198 348198 2015-09-19 11:15:00
And I want locations separated in time by at least 2h, so this would be the filtered dataset:
gps_data_animals_id acquisition_time
348179 348179 2015-09-18 00:00:00
348181 348181 2015-09-18 02:00:00
348184 348184 2015-09-18 04:30:00
348188 348188 2015-09-18 12:00:00
348189 348189 2015-09-18 17:15:00
348196 348196 2015-09-19 00:00:00
348197 348197 2015-09-19 06:01:00
348198 348198 2015-09-19 11:15:00
I've been playing a bit with the lag()
function as it seems to do more or less what I need, but I end up removing more than I want. This is what I have done so far:
dataset$time_diff <- unlist(tapply(dataset$acquisition_time, INDEX = dataset$animals_id,
FUN = function(x) c(0, `units<-`(diff(x), "hours"))))
And then I would remove those values of time_diff less than 2h, but that ends up removing more than I want because it would also remove e.g. gps_data_animals_id = 348181
, which I want to keep as it has the 2h interval with the first location.
What I think it could work: sequentially select the first two rows, calculate the time difference and remove the second row if the time difference would be less than 2h. And then group the two first rows again and repeat the process. But I'm not sure how to do that, code-wise.
Any thoughts?
Here's the reproducible example of the dataset:
structure(list(gps_data_animals_id = 348179:348198, acquisition_time = structure(c(1442534400,
1442540700, 1442541600, 1442542500, 1442543400, 1442550600, 1442551500,
1442552400, 1442556000, 1442577600, 1442596500, 1442597400, 1442598300,
1442599200, 1442600100, 1442601000, 1442601900, 1442620800, 1442642460,
1442661300), class = c("POSIXct", "POSIXt"), tzone = "GMT")), row.names = 348179:348198, class = "data.frame")
library(dplyr)
library(purrr)
df1 %>%
filter(accumulate(c(120, as.double(diff(acquisition_time), units = "mins")),
~ifelse(.x + .y <= 120, .x + .y, .y)) >= 120)
#> gps_data_animals_id acquisition_time
#> 1 348179 2015-09-18 00:00:00
#> 2 348181 2015-09-18 02:00:00
#> 3 348184 2015-09-18 04:30:00
#> 4 348188 2015-09-18 12:00:00
#> 5 348189 2015-09-18 17:15:00
#> 6 348196 2015-09-19 00:00:00
#> 7 348197 2015-09-19 06:01:00
#> 8 348198 2015-09-19 11:15:00
Created on 2024-07-11 with reprex v2.0.2