Search code examples
rdata.tablematchlarge-data

Multiple matches in different row and columns for large data


I have some problems with multiple matches in two different tables (1 million rows * 15; 3000 * 20) which potentially may get much larger (10 million rows).

My solutions work but I would like to make the script as fast as possible considering that I may have to use it with larger data frames. I am using the r package data.table.

Consider two example tables where no row can be deleted:

table 1 - The ToMach column equal to FALSE means that the associated tag is not present in table2, this step reduced by two order of magnitude the matches to be performed:

set.seed(99)
table1 <- data.table(Tag = sample(paste0("tag_",1:3), 5, replace = T))
table1[ , ToMatch := ifelse(Tag == "tag_1", F, T)]

table1
     Tag ToMatch
1: tag_2    TRUE
2: tag_1   FALSE
3: tag_3    TRUE
4: tag_3    TRUE
5: tag_2    TRUE

table2:

set.seed(99)
table2 <- data.table(center = sample(paste0("tag_",2:8), 5, replace = T),
                 north  = sample(paste0("tag_",2:8), 5, replace = T),
                 south  = sample(paste0("tag_",2:8), 5, replace = T))

> table2
   center north south
1:  tag_6 tag_8 tag_5
2:  tag_2 tag_6 tag_5
3:  tag_6 tag_4 tag_3
4:  tag_8 tag_4 tag_6
5:  tag_5 tag_3 tag_6

My objective is to find the rows of table2 where the tags of table1 are found (can be in any column of the above columns). I was thinking of the output as a list:

Output:

     Tag ToMatch output
1: tag_2    TRUE      2
2: tag_1   FALSE     NA
3: tag_3    TRUE    3,5
4: tag_3    TRUE    3,5
5: tag_2    TRUE      2

My solution:

What rows of table 1 are to evaluate

match.index <- which(table1$ToMatch == T)
> match.index
[1] 1 3 4 5

Pool all tags from table 2. Maintain the row order using t (tag_6 tag_8 tag_5 tag_2 tag_6 tag_5 ...)

all.tags <- as.vector(t(table2))
> all.tags
 [1] "tag_6" "tag_8" "tag_5" "tag_2" "tag_6" "tag_5" "tag_6"
 [8] "tag_4" "tag_3" "tag_8" "tag_4" "tag_6" "tag_5" "tag_3"
[15] "tag_6"

Predefine an empty list

list.results <- as.list(rep(as.numeric(NA), dim(table1)[1]))

The loop:

for (i in 1:length(match.index)) {

    list.results[[ match.index[i] ]] <- ceiling(

        grep(table1[match.index[i], Tag], all.tags) 

        /3) 
}

# dividing the index of all.tags found with grep by 3 (the original
# number of columns in table2) and rounding up to the closest integer 
# (ceiling) return the index of the original table 2 where the tag 
# is located

Final output:

> table1[ , output := list.results]
> table1
     Tag ToMatch output
1: tag_2    TRUE      2
2: tag_1   FALSE     NA
3: tag_3    TRUE    3,5
4: tag_3    TRUE    3,5
5: tag_2    TRUE      2

Do you have any suggestion to speed up this code?

Thank you in advance


Solution

  • The difficulty is mainly in the wide representation of table2. Once that's been melted down, the rest is easy:

    melt(table2[, id := .I], id = 'id')[
      table1, on = c(value = 'Tag'), .(list(if(ToMatch) id)), by = .EACHI]
    #   value   V1
    #1: tag_2    2
    #2: tag_1 NULL
    #3: tag_3  5,3
    #4: tag_3  5,3
    #5: tag_2    2
    

    And if you have a lot of duplicates - unique your data beforehand:

    melt(table2[, id := .I], id = 'id')[
      unique(table1), on = c(value = 'Tag'), .(list(if(ToMatch) id)), by = .EACHI][
      table1, on = c(value = 'Tag')]