Search code examples
runiquedata.table

How do I extract the unique rows from a subset of columns in a data table?


I would like to take the unique rows from a data.table, given a subset of columns and a condition in i. What is the best way of going about it? ("Best" in terms of computing speed and short or readable syntax)

set.seed(1)
jk <- data.table(c1 = sample(letters,60,replace = TRUE), 
                 c2 = sample(c(TRUE,FALSE),60, replace = TRUE), 
                 c3 = sample(letters,60, replace = TRUE),
                 c4 = sample.int(10,60, replace = TRUE)
                 )

Say I'd like to find the unique combinations of c1 and c2 where c4 is 10. I can think of a couple of ways to do it but am not sure what is optimal. Whether the columns to extract are keyed or not may also be important.

## works but gives an extra column
jk[c4 >= 10, TRUE, keyby = list(c1,c2)]
## this removes extra column
jk[c4 >= 10, TRUE, keyby = list(c1,c2)][,V1 := NULL]

## this seems like it could work
## but no j-expression with a keyby throws an error
jk[c4 >= 10, , keyby = list(c1,c2)]

## using unique with .SD
jk[c4 >= 10, unique(.SD), .SDcols = c("c1","c2")]

Solution

  • The most straightforward, to me at least, would be either unique(jk[c4 >= 10, list(c1, c2)]) as suggested by @Justin, or unique(jk[c4 >= 10, c("c1", "c2")]). The latter of these is the quickest of the four suggestions so far, at least on my laptop:

    microbenchmark(
    a=jk[c4 >= 10, list(c1,c2), keyby = list(c1,c2)][,c("c1","c2")],
    b=jk[c4 >= 10, unique(.SD), .SDcols = c("c1","c2")],
    c=unique(jk[c4>=10,list(c1,c2)]),
    d=unique(jk[c4>=10,c("c1","c2")])
    )
    
    Unit: microseconds
     expr      min       lq    median        uq      max neval
        a 1378.742 1456.676 1494.9380 1531.1395 2515.796   100
        b  906.404  943.072  963.7790  997.4930 3805.846   100
        c 1167.125 1201.988 1232.3500 1272.2250 2077.047   100
        d  627.768  653.314  669.8625  683.8045  739.808   100