Is there a R package to assist in large data processing?

I am processing a large dataset (after being cleaned). The data set is then processed to create an adjacency matrix, which is passed a logicEval to id obs that contain the uniqueID. 5

When running the code snippet to create adjacency matrix, the process takes a huge amount of time to process (and sometimes, it just freezes).

Obviously, this is because the function is checking each of the unique elements (n=10901) and marking TRUE/FALSE if it appears in the observation. An example (greatly reduced):

  |Obs_1 |Obs_2 |Obs_3 |Obs_4 |Obs_5 | logEval|
  |:-----|:-----|:-----|:-----|:-----|-------:|
  |TRUE  |FALSE |FALSE |FALSE |FALSE |       1|
  |FALSE |TRUE  |FALSE |FALSE |FALSE |       1|
  |FALSE |FALSE |TRUE  |FALSE |FALSE |       1|
  |FALSE |FALSE |FALSE |TRUE  |FALSE |       1|
  |FALSE |FALSE |FALSE |FALSE |TRUE  |       1|
  |FALSE |FALSE |FALSE |FALSE |TRUE  |       1|
  |FALSE |FALSE |FALSE |FALSE |FALSE |       0|
  |FALSE |FALSE |FALSE |FALSE |FALSE |       0|
  |FALSE |FALSE |TRUE  |FALSE |FALSE |       1|
  |TRUE  |FALSE |FALSE |FALSE |FALSE |       1|
  |FALSE |FALSE |FALSE |FALSE |TRUE  |       1|
  |FALSE |FALSE |FALSE |FALSE |FALSE |       0|
  |FALSE |FALSE |FALSE |FALSE |FALSE |       0|

In actuality, the Obs=43 and there are >10 0000 comparisons.

Problem: R crashes. Is there a better way to run this without having it crash due to size?

Code snippet:

  r
df1<-data.table(col1=sample(500000:500900,700,replace = T),
                col2=sample(500000:500900,700,replace = T),
                col3=sample(500000:500900,700,replace = T),
                col4=sample(500000:500900,700,replace = T),
                col5 = sample(500000:500900,700,replace = T),
                col6 = sample(500000:500900,700,replace = T),
                col7 = sample(500000:500900,700,replace = T),
                col8 = sample(500000:500900,700,replace = T),
                col9 = sample(500000:500900,700,replace = T),
                col10 = sample(500000:500900,700,replace = T),
                col11 = sample(500000:500900,700,replace = T),
                col12 = sample(500000:500900,700,replace = T),
                col13 = sample(500000:500900,700,replace = T),
                col14 = sample(500000:500900,700,replace = T),
                col15 = sample(500000:500900,700,replace = T),
                col16 = sample(500000:500900,700,replace = T),
                col17 = sample(500000:500900,700,replace = T),
                col18 = sample(500000:500900,700,replace = T),
                col19 = sample(500000:500900,700,replace = T),
                col20 = sample(500000:500900,700,replace = T),
                col21 = sample(500000:500900,700,replace = T),
                col22 = sample(500000:500900,700,replace = T),
                col23 = sample(500000:500900,700,replace = T),
                col24 = sample(500000:500900,700,replace = T),
                col25 = sample(500000:500900,700,replace = T),
                col26 = sample(500000:500900,700,replace = T),
                col27 = sample(500000:500900,700,replace = T),
                col28 = sample(500000:500900,700,replace = T),
                col29 = sample(500000:500900,700,replace = T),
                col30 = sample(500000:500900,700,replace = T),
                col31 = sample(500000:500900,700,replace = T),
                col32 = sample(500000:500900,700,replace = T),
                col33 = sample(500000:500900,700,replace = T),
                col34 = sample(500000:500900,700,replace = T),
                col35 = sample(500000:500900,700,replace = T),
                col36 = sample(500000:500900,700,replace = T),
                col37 = sample(500000:500900,700,replace = T),
                col38 = sample(500000:500900,700,replace = T),
                col39 = sample(500000:500900,700,replace = T),
                col40 = sample(500000:500900,700,replace = T),
                col41 = sample(500000:500900,700,replace = T),
                col42 = sample(500000:500900,700,replace = T),
                col43 = sample(500000:500900,700,replace = T))


#find all ids via table
uniqueIDs<-as.character(unique(unlist(df1)))

df1<-data.table(df1)

#creating adjacency matrix
mat <- sapply(uniqueIDs, function(s) apply(dt1, 1, function(x) s %in% x)) 

#clean-up 
colnames(mat) <- uniqueIDs

rownames(mat) <- paste0("row", seq(nrow(dt1)))

mat<-data.table(mat)

mat<-data.table(t(mat))

#apply logical evaluation to count number of TRUE
mat$logEval<-rowSums(mat==TRUE)

Want to make a small update to ensure I am making my overall goal clear:

-dataset has x (43) obs and each obs has y (200) nbrids.

the goal of running the above code is to create an adjacency matrix to id the nbrids (y) that appear per column. [For example, from the unique nbrids, does y(1) appear in x(i);does y(2)...does y(900)].
i am not concerned with x, per se. the end goal is:

From the unique ids throughout the matrix, what uniqueids appear together & how often [this is why I create the logic test to count .n(i)==TRUE]…for those >2, i can filter as it is likely that such rows share nbrids.

Sample end matrix;

  r

    From        To                Weight
    50012       50056             5
    50012       50032             3
    …
    50063      50090              9

Man thats a mouthful _

Solution

2nd Edit:

These options seem to get to your expected output in your edit. Both options rely on self-joins to look at which combos are there. The first option uses lapply() to do the self-join one column at a time while the latter melt()s and then self-joins the entire dataset. For smaller datasets, lapply() is slower but when trying 7,000 rows, it still came through whereas the melt and self-join created too large of a data frame.

One additional note, this dataset doesn't really have many unique values. If I knew it was sparse, I'd likely add a line looking to filter out values which were not duplicated in the entire dataset.

library(data.table)

# generate data -----------------------------------------------------------
set.seed(1234)
dt1<- data.table(replicate(43, sample(500000:500900,700, replace = TRUE)))

rbindlist(
  lapply(dt1
       , function(x) {
         nbrid_dt = data.table(nbrid = unique(x))

         nbrid_dt[nbrid_dt
                  , on = .(nbrid < nbrid)
                  , j = .(From = x.nbrid, To = i.nbrid)
                  , nomatch = 0L
                  , allow.cartesian = T]
       }
       )
  )[, .N, keyby = .(From, To)]

          From     To  N
     1: 500000 500001 11
     2: 500000 500002 11
     3: 500000 500003  7
     4: 500000 500004  9
     5: 500000 500005 13
    ---                 
405446: 500897 500899 12
405447: 500897 500900 10
405448: 500898 500899 13
405449: 500898 500900 12
405450: 500899 500900  9

#all at once

molten_dt <- unique(melt(dt1))
setkey(molten_dt, variable)

molten_dt[molten_dt
          , on = .(value < value
                   ,variable = variable
                    )
          , .(From = x.value, To = i.value)
          , allow.cartesian = TRUE
          , nomatch = 0L
          ][!is.na(From), .N, keyby = .(From, To)]

Original: I'm not fully following, but if you are mainly after the amount of counts in your 43 columns, it may be beneficial to gather / melt the data.

molten_dt <- melt(dt1)

molten_dt[, N := length(unique(variable)), by = value]

      variable  value  N
   1:       V1 500102  9
   2:       V1 500560  8
   3:       V1 500548  9
   4:       V1 500561 12
   5:       V1 500775  9
  ---                                    
8596:      V43 500096  7
8597:      V43 500320  6
8598:      V43 500205 14
8599:      V43 500711  7
8600:      V43 500413 11

#or you can aggregate instead of mutate-in-place

molten_dt[, .(N = length(unique(variable))), by = value]

      value  N
  1: 500102  9
  2: 500560  8
  3: 500548  9
  4: 500561 12
  5: 500775  9
 ---          
897: 500753  4
898: 500759  4
899: 500816  6
900: 500772  4
901: 500446  2

Also, my answer doesn't 100% agree with @Konrad. When there are duplicated values, there seems to be one additional count for @Konrad's solution.

Data:

set.seed(1234)
dt1<- as.data.table(replicate(43, sample(500000 : 500900, 200, replace = TRUE)))
#h/t for @Konrad for the quick way to make 43 columns

1st Edit: If you are only interested in the count of each value, you can do the following:

mat_data <- matrix(replicate(43, sample(500000 : 500900, 700, replace = TRUE)), ncol = 43)

table(unlist(apply(mat_data, 2, unique)))

It's the fastest approach but the problem is that you loose information about which column supplied the information.

Unit: milliseconds
           expr     min      lq     mean   median       uq      max neval
 melt_and_count 53.3914 53.8926 57.38576 55.95545 58.55605  79.2055    20
  table_version 11.0566 11.1814 12.24900 11.56760 12.82110  16.4351    20
 vapply_version 63.1623 64.8274 69.86041 67.84505 71.40635 108.2279    20