I have program that needs to repeat a merge which uses the same column from Table A over and over, but changes the column from table B. Doing it with a loop that uses the data.table::merge command repeatedly is quite slow, so I'm wondering if there's a faster way to do it.
For an example:
Imagine a table "A" of fruits, with two columns, "fruit_name" and "price."
And another table "B" of baskets, with 3 columns, "fruit_1", "fruit_2", and "fruit_3"
I would like to get the sum price of the fruits in table B for each row. I could do 3 merges, all using "fruit_name" from the fruit table, and the first using "fruit_1", second "fruit_2", and 3rd "fruit_3" from the baskets table.
Merges take a long time, though. Is there anyway to do this more efficiently computationally? Here is the code laying out the example and getting the desired result, but doing it too slowly.
I generally use data.table and it's generally efficient, so I prefer a data.table solution but open to others if they are faster than 3 merges.
Also, I imagine I could get the data in long format and do one merge, ideally I'd avoid that,because my data makes a lot of sense in wide format, comes in wide, and needs to be exported wide. But if it's definitely the best practice then I guess that's good to know.
Thanks all for your time!
Edit: I chose the answer that I did because for my use case, at least, its limitations are not a problem. And, it is very, very fast.
The more upvoted answer takes a bit longer to do the initial melt step than the chosen one takes to do the whole task. So, for speed alone, I think the chosen answer is best, if it works for you.
If it doesn't, then the most upvoted answer is probably good!
library(data.table)
fruits <- data.table(fruit_name = c('orange', 'apple', 'pear', 'kiwi', 'blueberry')
, price = c(1, 1.531, 2.1, 2.25, 3.03)
)
baskets <- data.table(fruit_1 = c('orange', 'apple', 'apple', 'pear')
,fruit_2 = c('apple', 'pear', 'kiwi', 'kiwi')
,fruit_3 = c('pear', 'kiwi', 'blueberry', 'blueberry'))
result <- copy (baskets)
result <- merge(result, fruits, by.x = 'fruit_1', by.y = 'fruit_name')
setnames(result, 'price', 'price_1')
result <- merge(result, fruits, by.x = 'fruit_2', by.y = 'fruit_name')
setnames(result, 'price', 'price_2')
result <- merge(result, fruits, by.x = 'fruit_3', by.y = 'fruit_name')
setnames(result, 'price', 'price_3')
result[,price_total := price_1 + price_2 + price_3]
Using the function chmatch
from data.table, you can solve your problem as follow. If it is still slow, replace chmatch
with the function collapse::fmatch
and check if there is some performance improvement.
baskets[, price_total := rowSums(sapply(.SD, \(x) fruits$price[chmatch(x, fruits$fruit_name)])), .SDcols=patterns("fruit")]
# fruit_1 fruit_2 fruit_3 price_total
# <char> <char> <char> <num>
# 1: orange apple pear 4.631
# 2: apple pear kiwi 5.881
# 3: apple kiwi blueberry 6.811
# 4: pear kiwi blueberry 7.380