Search code examples
rdata.tablematchunique

Use unique in data.table but keeping track of original rows (e.g. unique and then match)


My first question to this wonderful community:

I want to know if there is a more efficient way of doing this but with a data.table:

# vector example
all <- c(4,4,2,1,2)
(uniq <- unique(all))
## [1] 4 2 1
match(all,uniq)
## [1] 1 1 2 3 2

The idea is to get all the unique rows from a data.table but, at the same time, to identify for each row of the original data.table to which row of the unique DT it matches. Maybe it is more clear with the example below.

Edit: My desire output is a data table with the unique rows AND a vector with the row positions of the original table in the unique-rows' table (i.e. a vector of length nrow(dt))

This is the best solution that I've found so far:

library(data.table) # data.table 1.12.0
# small data.table example   ## edited to be reproducible 
m <- matrix(c(5, 5, 5, 5, 6, 5, 6, 6, 7, 7, 5, 5, 7, 5, 6, 6), nrow = 8, ncol = 2)
dt <- setDT(as.data.frame(m)) 
dt.uniq <- unique(dt)

#this returns the ID or position of the rows in the first table on the second table
match(transpose(dt), transpose(dt.uniq))
## [1] 1 1 2 2 3 2 4 4


Solution

  • Here is a possible approach with data.table version 1.12.4:

    DT[, urow := unique(DT)[DT, on=.NATURAL, which=TRUE]]
    

    output:

       V1 V2 urow
    1:  7  5    1
    2:  5  6    2
    3:  6  5    3
    4:  7  6    4
    5:  7  7    5
    6:  7  6    4
    7:  7  5    1
    8:  6  6    6
    

    data:

    library(data.table) #data.table_1.12.4
    set.seed(9L)
    DT <- as.data.table(matrix(sample(5:7, 16, replace = T), nrow = 8, ncol = 2))
    DT
    

    Edit to address comments:

    1) on=.NATURAL

    From ?data.table under the Arguments section in on:

    When .NATURAL keyword provided then natural join is made (join on common columns).

    That is, in X[Y, on=.NATURAL], keys used for joining are columns that are both found in X and Y (intersect(names(X), names(Y))).

    2) which=TRUE

    From ?data.table under the Arguments section in which:

    TRUE returns the row numbers of x that i matches to. If NA, returns the row numbers of i that have no match in x. By default FALSE and the rows in x that match are returned.

    That is, the row indices of x (unique(DT) in OP) are returned from the join with i (DT here).