Search code examples
rdataframesubset

Removing the last 2 rows of each factor level in r


I have a dataframe, where a factor has different amounts of rows. I would like to remove the last 2 rows of each factor level. test<-data.frame(id=c(1,1,1,1,1,2,2,2,2,2,2,3,3,3,3,3,3,3), val=c(3,5,4,6,7,4,1,6,7,8,4,2,0,3,6,8,1,2), trt=c(1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2))

the outcome should look like this:

   id val trt
1   1   3   1
2   1   5   1
3   1   4   1
4   2   4   1
5   2   1   1
6   2   6   1
7   2   7   1
8   3   2   2
9   3   0   2
10  3   3   2
11  3   6   2
12  3   8   2

An added thing I need is to keep various other columns in the dataframe as well. I edited the example df to include another factor, which is not relevant for this exercise, but I do need it (and a bunch more) for future calculations.


Solution

  • UPDATE 1/23/2024

    A more performant data.table option that does not rearrange the original data:

    library(data.table)
    
    setDT(test)[-test[,.I[(.N - 1L):.N], id][[2]]][]
    #>     id val trt
    #>  1:  1   3   1
    #>  2:  1   5   1
    #>  3:  1   4   1
    #>  4:  2   4   1
    #>  5:  2   1   1
    #>  6:  2   6   1
    #>  7:  2   7   1
    #>  8:  3   2   2
    #>  9:  3   0   2
    #> 10:  3   3   2
    #> 11:  3   6   2
    #> 12:  3   8   2
    

    Benchmarking against my original answer using a larger dataset:

    microbenchmark::microbenchmark(
      SD = test[,.SD[1:(.N - 2)], id],
      I = test[-test[,.I[(.N - 1L):.N], id][[2]]],
      unit = "relative"
    )
    #> Unit: relative
    #>  expr     min       lq     mean  median       uq      max neval
    #>    SD 66.9625 65.77887 56.49757 62.2506 57.49157 17.64804   100
    #>     I  1.0000  1.00000  1.00000  1.0000  1.00000  1.00000   100
    

    Check that the results are equivalent:

    identical(
      setorder(test[,.SD[1:(.N - 2)], id], id),
      setorder(test[-test[,.I[(.N - 1L):.N], id][[2]]], id)
    )
    #> [1] TRUE
    

    Original Answer

    With data.table:

    library(data.table)
    
    setDT(test)[,.(val = val[1:(.N - 2)]), id][]
    #>     id val
    #>  1:  1   3
    #>  2:  1   5
    #>  3:  1   4
    #>  4:  2   4
    #>  5:  2   1
    #>  6:  2   6
    #>  7:  2   7
    #>  8:  3   2
    #>  9:  3   0
    #> 10:  3   3
    #> 11:  3   6
    #> 12:  3   8
    

    Or, using .SD if you have additional columns:

    setDT(test)[,.SD[1:(.N - 2)], id][]
    #>     id val trt
    #>  1:  1   3   1
    #>  2:  1   5   1
    #>  3:  1   4   1
    #>  4:  2   4   1
    #>  5:  2   1   1
    #>  6:  2   6   1
    #>  7:  2   7   1
    #>  8:  3   2   2
    #>  9:  3   0   2
    #> 10:  3   3   2
    #> 11:  3   6   2
    #> 12:  3   8   2
    

    Note that this will rearrange the data relative to the original if it is not already grouped by id.

    Base R (if test is sorted by id):

    test[sequence((n <- rle(test$id)[[1]]) - 2, c(1, cumsum(n[-length(n)]) + 1)),]
    #>     id val trt
    #>  1:  1   3   1
    #>  2:  1   5   1
    #>  3:  1   4   1
    #>  4:  2   4   1
    #>  5:  2   1   1
    #>  6:  2   6   1
    #>  7:  2   7   1
    #>  8:  3   2   2
    #>  9:  3   0   2
    #> 10:  3   3   2
    #> 11:  3   6   2
    #> 12:  3   8   2