What I am trying to do seems simple- but after 2 days of searching I have decided to post my first question here to see if anyone can help.
I have a dataframe(df) of 5 variables and 250,000 rows. Sample:
date.time Lat Lon Depth ms
1: 2015-11-23 01:14:00 -3.230916 135.0655 100.5 0.391
2: 2015-11-23 03:05:00 -3.231362 135.0650 300.5 0.225
3: 2015-11-23 03:22:00 -3.231431 135.0649 500.5 0.091
4: 2015-11-23 10:51:00 -3.233221 135.0632 400.5 0.0916
5: 2015-11-23 10:52:00 -3.233225 135.0632 300.5 0.0333
6: 2015-11-23 11:32:00 -3.233383 135.0630 100.5 0.3833
7: 2015-11-23 11:33:00 -3.233387 135.0630 200.0 -0.0750
8: 2015-11-23 12:14:00 -3.233549 135.0629 220.0 0.3166
9: 2015-11-23 12:15:00 -3.233553 135.0629 300.5 0.0083
10: 2015-11-23 12:39:00 -3.233647 135.0628 500.5 0.3000
11: 2016-10-15 00:37:30 -3.349524 135.0997 550.5 -0.0083
12: 2016-10-15 00:38:30 -3.349537 135.0997 600.0 -0.0583
13: 2016-10-15 00:39:30 -3.349550 135.0998 400.5 0.0583
14: 2016-10-15 00:39:30 -3.349550 135.0998 400.5 0.0583
15: 2016-10-15 00:39:30 -3.349550 135.0998 600.5 0.0583
I want to select the previous n rows (determined by whether it is in increasing order; i.e 100, 200, 300, 400, 500, 600 NOT 100, 200, 400, 100, 50) before Depth values > 500m of the largest value above 500m (to avoid repeats of the same data). I want each of those rows in their entirety to appear in a new dataframe (newdf) :
date.time Lat Lon Depth ms
1: 2015-11-23 01:14:00 -3.230916 135.0655 100.5 0.391
2: 2015-11-23 03:05:00 -3.231362 135.0650 300.5 0.225
**3: 2015-11-23 03:22:00 -3.231431 135.0649 500.5 0.091**
6: 2015-11-23 11:32:00 -3.233383 135.0630 100.5 0.3833
7: 2015-11-23 11:33:00 -3.233387 135.0630 200.0 -0.0750
8: 2015-11-23 12:14:00 -3.233549 135.0629 220.0 0.3166
9: 2015-11-23 12:15:00 -3.233553 135.0629 300.5 0.0083
10: 2015-11-23 12:39:00 -3.233647 135.0628 500.5 0.3000
11: 2016-10-15 00:37:30 -3.349524 135.0997 550.5 -0.0083
**12: 2016-10-15 00:38:30 -3.349537 135.0997 600.0 -0.0583**
14: 2016-10-15 00:39:30 -3.349550 135.0998 400.5 0.0583
**15: 2016-10-15 00:39:30 -3.349550 135.0998 600.5 0.0583**
I have tried the following code:
which_max <- which(df$Depth >= 500)
encoding <- rle(diff(df$Depth) > 0)
# these contain the start/end indices of all continuously increasing/decreasing subsets
ends <- cumsum(encoding$lengths) + 1L
starts <- ends - encoding$lengths
# filter out the decreasing subsets
starts <- starts[encoding$values]
ends <- ends[encoding$values]
# find the one that contains the maximum
interval <- which(starts <= which_max & ends >= which_max)
out <- df[starts[interval]:ends[interval],] #picks only selected interval to print
based on a previous stack post (Subset only continuously increasing values to max value), but can only get one set of the highest values from my dataset printed instead of each from the original (df):
date.time Lat Lon Depth ms
1: 2016-05-11 23:44:30 1.769763 136.6246 102.0 0.600
2: 2016-05-11 23:53:30 1.773071 136.6247 108.0 0.7250
3: 2016-05-11 23:54:30 1.773439 136.6247 193.0 1.4166
4: 2016-05-11 23:55:30 1.773806 136.6248 281.5 1.475
5: 2016-05-11 23:56:30 1.774174 136.6248 364.5 1.383
6: 2016-05-11 23:57:30 1.774542 136.6248 447.0 1.3750
7: 2016-05-11 23:58:30 1.774910 136.6248 528.0 1.350
8: 2016-05-11 23:59:30 1.775278 136.6248 609.5 1.358
9: 2016-05-12 00:00:30 1.775646 136.6248 690.0 1.3416
10: 2016-05-12 00:01:30 1.776013 136.6249 770.0 1.33333
I am assuming I need to employ some type of loop(?) but am relatively new to coding and unsure how to go about it.
EDIT: I have also tried using "lag" but it does not solve needing multiple increasing rows or not double back counting for multiple rows of >500 m (ie. 500, 550, 600, 700...)
I have also used:
df$selecteddepth <- df$Depth * (c(0, diff(df$Depth)) >= 10)
Which selects depths greater than the difference of 10 (which means they are always increasing) but does not address selecting depths of over 500m or removing duplicates
Here is a subset using dput()
structure(list(date.time = structure(c(1450574990, 1450575050,
1450575110, 1450575170, 1450575230, 1450575290, 1450575350, 1450575410,
1450575470, 1450575530, 1450575590, 1450575650, 1450575710, 1450575770,
1450575830, 1450575890), class = c("POSIXct", "POSIXt"), tzone = "UTC"),
Lat = c(-3.24669178745284, -3.24667124000555, -3.24665068714376,
-3.24663012886971, -3.24660956518562, -3.24658899609375,
-3.24656842159633, -3.24654784169558, -3.24652725639375,
-3.24650666569307, -3.24648606959577, -3.24646546810409,
-3.24644486122025, -3.24642424894649, -3.24640363128504,
-3.24638300823813), Lon = c(135.085169407522, 135.085165930176,
135.085162450626, 135.085158968873, 135.085155484919, 135.085151998764,
135.085148510411, 135.085145019861, 135.085141527116, 135.085138032177,
135.085134535045, 135.085131035722, 135.08512753421, 135.08512403051,
135.085120524624, 135.085117016552), Depth = c(373, 453,
500, 515.5, 521, 526.5, 512, 517.5, 522.5, 504, 522.5, 508.5,
481.5, 480, 474, 453), ms = c(1.60833333333333, 1.33333333333333,
0.783333333333333, 0.258333333333333, 0.0916666666666667,
0.0916666666666667, -0.241666666666667, 0.0916666666666667,
0.0833333333333333, -0.308333333333333, 0.308333333333333,
-0.233333333333333, -0.45, -0.025, -0.1, -0.35)), row.names = c(NA,
-16L), class = c("data.table", "data.frame"))
EDIT20-8-21 for det
Current output:
As you can see, the output is in descending depth(970 > 929.5 > 888> 851.5... where the ms are positive) I need the same idea shown above, but in ascending order so it would look something like this (made up data: 500 > 545 > 600 > 700) and the ms should be negative (most of the time) as the animal is diving (negative velocity). So I need the top number that is returned in depth to be smaller than the following numbers. I hope this clarifies it!
This needs to be done iteratively because value is changing. Idea is to find first position on which is greater value and then according to that position find rows that will be in that group and update needed parameters.
cur_start <- 1
cur_value <- 500L
x <- df$Depth
l <- list()
i <- 1
repeat{
if(cur_start > length(x)) break
first_greater <- which(x[cur_start:length(x)] > cur_value)[1]
if(is.na(first_greater)){
break
} else if(first_greater == 1){
cur_start <- cur_start + 1
next
}
pos_greater <- cur_start - 1 + first_greater
cur_value <- x[[pos_greater]]
res <- diff(x[pos_greater:cur_start]) < 0
if(all(res)){
l[[i]] <- cur_start:(pos_greater - 1)
} else {
l[[i]] <- rev(pos_greater - seq_len(which.min(res) - 1))
}
cur_start <- pos_greater + 1
i <- i + 1
}
lapply(l, function(x) df[x,])