I have some problems with multiple matches in two different tables (1 million rows * 15; 3000 * 20) which potentially may get much larger (10 million rows).
My solutions work but I would like to make the script as fast as possible considering that I may have to use it with larger data frames. I am using the r package data.table.
Consider two example tables where no row can be deleted:
table 1 - The ToMach column equal to FALSE means that the associated tag is not present in table2, this step reduced by two order of magnitude the matches to be performed:
set.seed(99)
table1 <- data.table(Tag = sample(paste0("tag_",1:3), 5, replace = T))
table1[ , ToMatch := ifelse(Tag == "tag_1", F, T)]
table1
Tag ToMatch
1: tag_2 TRUE
2: tag_1 FALSE
3: tag_3 TRUE
4: tag_3 TRUE
5: tag_2 TRUE
table2:
set.seed(99)
table2 <- data.table(center = sample(paste0("tag_",2:8), 5, replace = T),
north = sample(paste0("tag_",2:8), 5, replace = T),
south = sample(paste0("tag_",2:8), 5, replace = T))
> table2
center north south
1: tag_6 tag_8 tag_5
2: tag_2 tag_6 tag_5
3: tag_6 tag_4 tag_3
4: tag_8 tag_4 tag_6
5: tag_5 tag_3 tag_6
My objective is to find the rows of table2 where the tags of table1 are found (can be in any column of the above columns). I was thinking of the output as a list:
Output:
Tag ToMatch output
1: tag_2 TRUE 2
2: tag_1 FALSE NA
3: tag_3 TRUE 3,5
4: tag_3 TRUE 3,5
5: tag_2 TRUE 2
My solution:
What rows of table 1 are to evaluate
match.index <- which(table1$ToMatch == T)
> match.index
[1] 1 3 4 5
Pool all tags from table 2. Maintain the row order using t
(tag_6 tag_8 tag_5 tag_2 tag_6 tag_5 ...
)
all.tags <- as.vector(t(table2))
> all.tags
[1] "tag_6" "tag_8" "tag_5" "tag_2" "tag_6" "tag_5" "tag_6"
[8] "tag_4" "tag_3" "tag_8" "tag_4" "tag_6" "tag_5" "tag_3"
[15] "tag_6"
Predefine an empty list
list.results <- as.list(rep(as.numeric(NA), dim(table1)[1]))
The loop:
for (i in 1:length(match.index)) {
list.results[[ match.index[i] ]] <- ceiling(
grep(table1[match.index[i], Tag], all.tags)
/3)
}
# dividing the index of all.tags found with grep by 3 (the original
# number of columns in table2) and rounding up to the closest integer
# (ceiling) return the index of the original table 2 where the tag
# is located
Final output:
> table1[ , output := list.results]
> table1
Tag ToMatch output
1: tag_2 TRUE 2
2: tag_1 FALSE NA
3: tag_3 TRUE 3,5
4: tag_3 TRUE 3,5
5: tag_2 TRUE 2
Do you have any suggestion to speed up this code?
Thank you in advance
The difficulty is mainly in the wide representation of table2
. Once that's been melted down, the rest is easy:
melt(table2[, id := .I], id = 'id')[
table1, on = c(value = 'Tag'), .(list(if(ToMatch) id)), by = .EACHI]
# value V1
#1: tag_2 2
#2: tag_1 NULL
#3: tag_3 5,3
#4: tag_3 5,3
#5: tag_2 2
And if you have a lot of duplicates - unique your data beforehand:
melt(table2[, id := .I], id = 'id')[
unique(table1), on = c(value = 'Tag'), .(list(if(ToMatch) id)), by = .EACHI][
table1, on = c(value = 'Tag')]