I'm starting with a dataframe f
of 1.5kk rows and trying to update an nxn dataframe Channels
based on that f
dataframe. So far I've got 2 for-loops
that's leading to a very slow run of the script.
I have a table with user ids and channels they like most (you can think of it as genres of songs they like most) like below:
id_user | id_channel
--------------------
1 | 43
1 | 61
1 | 101
2 | 43
2 | 631
.. | ..
What I expect to get is a dataframe Channels
with columns and rows like below:
id channel
| 43 61 101 631
---------------------------------
43 | NA 6 31 9
61 | 3 NA 11 1
101 | 2 1 NA 23
631 | 10 2 3 NA
where I have id_channel
in both x and y-axis. It means that there are 6 users that likes channel 43 that also likes channel 61, 1 user that likes channel 101 that also likes channel 61 and so on.
What I've done is in the code below. It works, but since my original table has 1.5kk rows, it's taking about 25 hours to the code get to its end. I assume there might be a much more eficient way to do this. f
is the original data with 1.5kk rows (as dataframe), lst_users
(length 650k rows) and lst_channel
(length 50 rows) are vectors with distinct id_users
and id_channels
.
Basically, I subset the original table by id_user
and then for each one of them I loop over its liked channels updating the dataframe Channels
.
for (user in lst_users) {
sub <- f[f$id_user == user,]
for (j in 1:nrow(sub)) {
rindex <- which(lst_channels == sub$id_channel[j])
cindex <- which(lst_channels == unique(sub$id_channel[-j]))
Channels[rindex, cindex] <- Channels[rindex, cindex] + 1
}
}
The Channels
dataframe is initialized like below (NAs in its diagonal and 0s for the rest of the matrix):
Channels <- diag(NA, nrow= length(lst_channels), ncol= length(lst_channels))
I'm trying to come up with a solution using functions like apply, mapply, sapply, .., but couldn't come with something that actually works. Any ideas on how to approach this problem?
Edit:
As @alexis_laz pointed out there is a similar problem in this question whose solution can be implemented in here.
Solution: crossprod(table(f))
Benchmarking the solution Channels <- crossprod(table(f))
gives the following results:
Unit: seconds
expr min lq median uq max neval
crossprod(table(f[, c(1, 3)])) 6.059346 6.219557 6.31133 6.471866 7.358821 50
Light times faster than doing with 2 for-loops
that was taking 25 hours +.