I have a dataframe, where a factor has different amounts of rows. I would like to remove the last 2 rows of each factor level.
test<-data.frame(id=c(1,1,1,1,1,2,2,2,2,2,2,3,3,3,3,3,3,3), val=c(3,5,4,6,7,4,1,6,7,8,4,2,0,3,6,8,1,2), trt=c(1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2))
the outcome should look like this:
id val trt
1 1 3 1
2 1 5 1
3 1 4 1
4 2 4 1
5 2 1 1
6 2 6 1
7 2 7 1
8 3 2 2
9 3 0 2
10 3 3 2
11 3 6 2
12 3 8 2
An added thing I need is to keep various other columns in the dataframe as well. I edited the example df to include another factor, which is not relevant for this exercise, but I do need it (and a bunch more) for future calculations.
A more performant data.table
option that does not rearrange the original data:
library(data.table)
setDT(test)[-test[,.I[(.N - 1L):.N], id][[2]]][]
#> id val trt
#> 1: 1 3 1
#> 2: 1 5 1
#> 3: 1 4 1
#> 4: 2 4 1
#> 5: 2 1 1
#> 6: 2 6 1
#> 7: 2 7 1
#> 8: 3 2 2
#> 9: 3 0 2
#> 10: 3 3 2
#> 11: 3 6 2
#> 12: 3 8 2
Benchmarking against my original answer using a larger dataset:
microbenchmark::microbenchmark(
SD = test[,.SD[1:(.N - 2)], id],
I = test[-test[,.I[(.N - 1L):.N], id][[2]]],
unit = "relative"
)
#> Unit: relative
#> expr min lq mean median uq max neval
#> SD 66.9625 65.77887 56.49757 62.2506 57.49157 17.64804 100
#> I 1.0000 1.00000 1.00000 1.0000 1.00000 1.00000 100
Check that the results are equivalent:
identical(
setorder(test[,.SD[1:(.N - 2)], id], id),
setorder(test[-test[,.I[(.N - 1L):.N], id][[2]]], id)
)
#> [1] TRUE
With data.table
:
library(data.table)
setDT(test)[,.(val = val[1:(.N - 2)]), id][]
#> id val
#> 1: 1 3
#> 2: 1 5
#> 3: 1 4
#> 4: 2 4
#> 5: 2 1
#> 6: 2 6
#> 7: 2 7
#> 8: 3 2
#> 9: 3 0
#> 10: 3 3
#> 11: 3 6
#> 12: 3 8
Or, using .SD
if you have additional columns:
setDT(test)[,.SD[1:(.N - 2)], id][]
#> id val trt
#> 1: 1 3 1
#> 2: 1 5 1
#> 3: 1 4 1
#> 4: 2 4 1
#> 5: 2 1 1
#> 6: 2 6 1
#> 7: 2 7 1
#> 8: 3 2 2
#> 9: 3 0 2
#> 10: 3 3 2
#> 11: 3 6 2
#> 12: 3 8 2
Note that this will rearrange the data relative to the original if it is not already grouped by id
.
Base R (if test
is sorted by id
):
test[sequence((n <- rle(test$id)[[1]]) - 2, c(1, cumsum(n[-length(n)]) + 1)),]
#> id val trt
#> 1: 1 3 1
#> 2: 1 5 1
#> 3: 1 4 1
#> 4: 2 4 1
#> 5: 2 1 1
#> 6: 2 6 1
#> 7: 2 7 1
#> 8: 3 2 2
#> 9: 3 0 2
#> 10: 3 3 2
#> 11: 3 6 2
#> 12: 3 8 2