Multiple matches in different row and columns for large data

I have some problems with multiple matches in two different tables (1 million rows * 15; 3000 * 20) which potentially may get much larger (10 million rows).

My solutions work but I would like to make the script as fast as possible considering that I may have to use it with larger data frames. I am using the r package data.table.

Consider two example tables where no row can be deleted:

table 1 - The ToMach column equal to FALSE means that the associated tag is not present in table2, this step reduced by two order of magnitude the matches to be performed:

set.seed(99)
table1 <- data.table(Tag = sample(paste0("tag_",1:3), 5, replace = T))
table1[ , ToMatch := ifelse(Tag == "tag_1", F, T)]

table1
     Tag ToMatch
1: tag_2    TRUE
2: tag_1   FALSE
3: tag_3    TRUE
4: tag_3    TRUE
5: tag_2    TRUE

table2:

set.seed(99)
table2 <- data.table(center = sample(paste0("tag_",2:8), 5, replace = T),
                 north  = sample(paste0("tag_",2:8), 5, replace = T),
                 south  = sample(paste0("tag_",2:8), 5, replace = T))

> table2
   center north south
1:  tag_6 tag_8 tag_5
2:  tag_2 tag_6 tag_5
3:  tag_6 tag_4 tag_3
4:  tag_8 tag_4 tag_6
5:  tag_5 tag_3 tag_6

My objective is to find the rows of table2 where the tags of table1 are found (can be in any column of the above columns). I was thinking of the output as a list:

Output:

     Tag ToMatch output
1: tag_2    TRUE      2
2: tag_1   FALSE     NA
3: tag_3    TRUE    3,5
4: tag_3    TRUE    3,5
5: tag_2    TRUE      2

My solution:

What rows of table 1 are to evaluate

match.index <- which(table1$ToMatch == T)
> match.index
[1] 1 3 4 5

Pool all tags from table 2. Maintain the row order using t (tag_6 tag_8 tag_5 tag_2 tag_6 tag_5 ...)

all.tags <- as.vector(t(table2))
> all.tags
 [1] "tag_6" "tag_8" "tag_5" "tag_2" "tag_6" "tag_5" "tag_6"
 [8] "tag_4" "tag_3" "tag_8" "tag_4" "tag_6" "tag_5" "tag_3"
[15] "tag_6"

Predefine an empty list

list.results <- as.list(rep(as.numeric(NA), dim(table1)[1]))

The loop:

for (i in 1:length(match.index)) {

    list.results[[ match.index[i] ]] <- ceiling(

        grep(table1[match.index[i], Tag], all.tags) 

        /3) 
}

# dividing the index of all.tags found with grep by 3 (the original
# number of columns in table2) and rounding up to the closest integer 
# (ceiling) return the index of the original table 2 where the tag 
# is located

Final output:

> table1[ , output := list.results]
> table1
     Tag ToMatch output
1: tag_2    TRUE      2
2: tag_1   FALSE     NA
3: tag_3    TRUE    3,5
4: tag_3    TRUE    3,5
5: tag_2    TRUE      2

Do you have any suggestion to speed up this code?

Thank you in advance

Solution

The difficulty is mainly in the wide representation of table2. Once that's been melted down, the rest is easy:

melt(table2[, id := .I], id = 'id')[
  table1, on = c(value = 'Tag'), .(list(if(ToMatch) id)), by = .EACHI]
#   value   V1
#1: tag_2    2
#2: tag_1 NULL
#3: tag_3  5,3
#4: tag_3  5,3
#5: tag_2    2

And if you have a lot of duplicates - unique your data beforehand:

melt(table2[, id := .I], id = 'id')[
  unique(table1), on = c(value = 'Tag'), .(list(if(ToMatch) id)), by = .EACHI][
  table1, on = c(value = 'Tag')]