Search code examples
rperformancedataframeprocessing-efficiency

Most efficient way for looping over 1.5k rows and updating a dataframe


I'm starting with a dataframe f of 1.5kk rows and trying to update an nxn dataframe Channels based on that f dataframe. So far I've got 2 for-loops that's leading to a very slow run of the script.

I have a table with user ids and channels they like most (you can think of it as genres of songs they like most) like below:

id_user | id_channel
--------------------
   1    |     43
   1    |     61
   1    |    101
   2    |     43
   2    |    631
  ..    |     ..

What I expect to get is a dataframe Channels with columns and rows like below:

              id channel
     |  43    61    101   631
---------------------------------
43   |  NA     6     31     9
61   |   3    NA     11     1
101  |   2     1     NA    23
631  |  10     2      3    NA

where I have id_channel in both x and y-axis. It means that there are 6 users that likes channel 43 that also likes channel 61, 1 user that likes channel 101 that also likes channel 61 and so on.

What I've done is in the code below. It works, but since my original table has 1.5kk rows, it's taking about 25 hours to the code get to its end. I assume there might be a much more eficient way to do this. f is the original data with 1.5kk rows (as dataframe), lst_users (length 650k rows) and lst_channel (length 50 rows) are vectors with distinct id_users and id_channels.

Basically, I subset the original table by id_user and then for each one of them I loop over its liked channels updating the dataframe Channels.

for (user in lst_users) {
  sub <- f[f$id_user == user,]

  for (j in 1:nrow(sub)) {
    rindex <- which(lst_channels == sub$id_channel[j])
    cindex <- which(lst_channels == unique(sub$id_channel[-j]))

    Channels[rindex, cindex] <- Channels[rindex, cindex] + 1
  }

}

The Channels dataframe is initialized like below (NAs in its diagonal and 0s for the rest of the matrix):

Channels <- diag(NA, nrow= length(lst_channels), ncol= length(lst_channels))

I'm trying to come up with a solution using functions like apply, mapply, sapply, .., but couldn't come with something that actually works. Any ideas on how to approach this problem?

Edit:

As @alexis_laz pointed out there is a similar problem in this question whose solution can be implemented in here.

Solution: crossprod(table(f))


Solution

  • Benchmarking the solution Channels <- crossprod(table(f)) gives the following results:

    Unit: seconds
                               expr      min       lq  median       uq      max neval
     crossprod(table(f[, c(1, 3)])) 6.059346 6.219557 6.31133 6.471866 7.358821    50
    

    Light times faster than doing with 2 for-loops that was taking 25 hours +.