I have a dataset with 500 000 entries. Each entry in it has a userId and a productId. I want to get all userIds corresponding to each distinct productIds. But the list is to huge that none of the following method works for me, it's going very slow. Is there any faster solution.
Using lapply
: (Problem: Traversing the whole rpid list for each uniqPids elements)
orderedIndx <- lapply(uniqPids, function(x){
which(rpid %in% x)
})
names(orderedIndx) <- uniqPids
#Looking for indices with each unique productIds
Using For
loop:
orderedIndx <- list()
for(j in 1:length(rpid)){
existing <- length(orderedIndx[rpid[j]])
orderedIndx[rpid[j]][existing + 1] <- j
}
Sample Data:
ruid[1:10]
# [1] "a3sgxh7auhu8gw" "a1d87f6zcve5nk" "abxlmwjixxain" "a395borc6fgvxv" "a1uqrsclf8gw1t" "adt0srk1mgoeu"
[7] "a1sp2kvkfxxru1" "a3jrgqveqn31iq" "a1mzyo9tzk0bbi" "a21bt40vzccyt4"
rpid[1:10]
# [1] "b001e4kfg0" "b001e4kfg0" "b000lqoch0" "b000ua0qiq" "b006k2zz7k" "b006k2zz7k" "b006k2zz7k" "b006k2zz7k"
[9] "b000e7l2r4" "b00171apva"
Output should be like:
b001e4kfg0 -> a3sgxh7auhu8gw, a1d87f6zcve5nk
b000lqoch0 -> abxlmwjixxain
b000ua0qiq -> a395borc6fgvxv
b006k2zz7k -> a1uqrsclf8gw1t, adt0srk1mgoeu, a1sp2kvkfxxru1, a3jrgqveqn31iq
b000e7l2r4 -> a1mzyo9tzk0bbi
b00171apva -> a21bt40vzccyt4
Not exactly sure what type of output you want, or how many rows you have in your dataset, but I'd suggest 3 versions and you can chose the one you like. First version uses dplyr
and character values for your variables. I expect this to be slow if you have millions of rows. Second version uses dplyr
but factor variables. I expect this to be faster than the previous one. Third version uses data.table
. I expect this to be equally fast, or faster than the second version.
library(dplyr)
ruid =
c("a3sgxh7auhu8gw", "a1d87f6zcve5nk", "abxlmwjixxain", "a395borc6fgvxv",
"a1uqrsclf8gw1t", "adt0srk1mgoeu", "a1sp2kvkfxxru1", "a3jrgqveqn31iq",
"a1mzyo9tzk0bbi", "a21bt40vzccyt4")
rpid =
c("b001e4kfg0", "b001e4kfg0", "b000lqoch0", "b000ua0qiq", "b006k2zz7k",
"b006k2zz7k", "b006k2zz7k", "b006k2zz7k", "b000e7l2r4", "b00171apva")
### using dplyr and character values
dt = data.frame(rpid, ruid, stringsAsFactors = F)
dt %>%
group_by(rpid) %>%
do(data.frame(list_ruids = paste(c(.$ruid), collapse=", "))) %>%
ungroup
# rpid list_ruids
# (chr) (chr)
# 1 b000e7l2r4 a1mzyo9tzk0bbi
# 2 b000lqoch0 abxlmwjixxain
# 3 b000ua0qiq a395borc6fgvxv
# 4 b00171apva a21bt40vzccyt4
# 5 b001e4kfg0 a3sgxh7auhu8gw, a1d87f6zcve5nk
# 6 b006k2zz7k a1uqrsclf8gw1t, adt0srk1mgoeu, a1sp2kvkfxxru1, a3jrgqveqn31iq
# ----------------------------------
### using dplyr and factor values
dt = data.frame(rpid, ruid, stringsAsFactors = T)
dt %>%
group_by(rpid) %>%
do(data.frame(list_ruids = paste(c(levels(dt$ruid)[.$ruid]), collapse=", "))) %>%
ungroup
# rpid list_ruids
# (fctr) (chr)
# 1 b000e7l2r4 a1mzyo9tzk0bbi
# 2 b000lqoch0 abxlmwjixxain
# 3 b000ua0qiq a395borc6fgvxv
# 4 b00171apva a21bt40vzccyt4
# 5 b001e4kfg0 a3sgxh7auhu8gw, a1d87f6zcve5nk
# 6 b006k2zz7k a1uqrsclf8gw1t, adt0srk1mgoeu, a1sp2kvkfxxru1, a3jrgqveqn31iq
# -------------------------------------
library(data.table)
### using data.table
dt = data.table(rpid, ruid)
dt[, list(list_ruids = paste(c(ruid), collapse=", ")), by = rpid]
# rpid list_ruids
# 1: b001e4kfg0 a3sgxh7auhu8gw, a1d87f6zcve5nk
# 2: b000lqoch0 abxlmwjixxain
# 3: b000ua0qiq a395borc6fgvxv
# 4: b006k2zz7k a1uqrsclf8gw1t, adt0srk1mgoeu, a1sp2kvkfxxru1, a3jrgqveqn31iq
# 5: b000e7l2r4 a1mzyo9tzk0bbi
# 6: b00171apva a21bt40vzccyt4