Search code examples
radjacency-matrixdata-processing

Is there a R package to assist in large data processing?


I am processing a large dataset (after being cleaned). The data set is then processed to create an adjacency matrix, which is passed a logicEval to id obs that contain the uniqueID. 5

When running the code snippet to create adjacency matrix, the process takes a huge amount of time to process (and sometimes, it just freezes).

Obviously, this is because the function is checking each of the unique elements (n=10901) and marking TRUE/FALSE if it appears in the observation. An example (greatly reduced):

  |Obs_1 |Obs_2 |Obs_3 |Obs_4 |Obs_5 | logEval|
  |:-----|:-----|:-----|:-----|:-----|-------:|
  |TRUE  |FALSE |FALSE |FALSE |FALSE |       1|
  |FALSE |TRUE  |FALSE |FALSE |FALSE |       1|
  |FALSE |FALSE |TRUE  |FALSE |FALSE |       1|
  |FALSE |FALSE |FALSE |TRUE  |FALSE |       1|
  |FALSE |FALSE |FALSE |FALSE |TRUE  |       1|
  |FALSE |FALSE |FALSE |FALSE |TRUE  |       1|
  |FALSE |FALSE |FALSE |FALSE |FALSE |       0|
  |FALSE |FALSE |FALSE |FALSE |FALSE |       0|
  |FALSE |FALSE |TRUE  |FALSE |FALSE |       1|
  |TRUE  |FALSE |FALSE |FALSE |FALSE |       1|
  |FALSE |FALSE |FALSE |FALSE |TRUE  |       1|
  |FALSE |FALSE |FALSE |FALSE |FALSE |       0|
  |FALSE |FALSE |FALSE |FALSE |FALSE |       0|

In actuality, the Obs=43 and there are >10 0000 comparisons.

Problem: R crashes. Is there a better way to run this without having it crash due to size?

Code snippet:

  r
df1<-data.table(col1=sample(500000:500900,700,replace = T),
                col2=sample(500000:500900,700,replace = T),
                col3=sample(500000:500900,700,replace = T),
                col4=sample(500000:500900,700,replace = T),
                col5 = sample(500000:500900,700,replace = T),
                col6 = sample(500000:500900,700,replace = T),
                col7 = sample(500000:500900,700,replace = T),
                col8 = sample(500000:500900,700,replace = T),
                col9 = sample(500000:500900,700,replace = T),
                col10 = sample(500000:500900,700,replace = T),
                col11 = sample(500000:500900,700,replace = T),
                col12 = sample(500000:500900,700,replace = T),
                col13 = sample(500000:500900,700,replace = T),
                col14 = sample(500000:500900,700,replace = T),
                col15 = sample(500000:500900,700,replace = T),
                col16 = sample(500000:500900,700,replace = T),
                col17 = sample(500000:500900,700,replace = T),
                col18 = sample(500000:500900,700,replace = T),
                col19 = sample(500000:500900,700,replace = T),
                col20 = sample(500000:500900,700,replace = T),
                col21 = sample(500000:500900,700,replace = T),
                col22 = sample(500000:500900,700,replace = T),
                col23 = sample(500000:500900,700,replace = T),
                col24 = sample(500000:500900,700,replace = T),
                col25 = sample(500000:500900,700,replace = T),
                col26 = sample(500000:500900,700,replace = T),
                col27 = sample(500000:500900,700,replace = T),
                col28 = sample(500000:500900,700,replace = T),
                col29 = sample(500000:500900,700,replace = T),
                col30 = sample(500000:500900,700,replace = T),
                col31 = sample(500000:500900,700,replace = T),
                col32 = sample(500000:500900,700,replace = T),
                col33 = sample(500000:500900,700,replace = T),
                col34 = sample(500000:500900,700,replace = T),
                col35 = sample(500000:500900,700,replace = T),
                col36 = sample(500000:500900,700,replace = T),
                col37 = sample(500000:500900,700,replace = T),
                col38 = sample(500000:500900,700,replace = T),
                col39 = sample(500000:500900,700,replace = T),
                col40 = sample(500000:500900,700,replace = T),
                col41 = sample(500000:500900,700,replace = T),
                col42 = sample(500000:500900,700,replace = T),
                col43 = sample(500000:500900,700,replace = T))


#find all ids via table
uniqueIDs<-as.character(unique(unlist(df1)))

df1<-data.table(df1)

#creating adjacency matrix
mat <- sapply(uniqueIDs, function(s) apply(dt1, 1, function(x) s %in% x)) 

#clean-up 
colnames(mat) <- uniqueIDs

rownames(mat) <- paste0("row", seq(nrow(dt1)))

mat<-data.table(mat)

mat<-data.table(t(mat))

#apply logical evaluation to count number of TRUE
mat$logEval<-rowSums(mat==TRUE)

Want to make a small update to ensure I am making my overall goal clear:

-dataset has x (43) obs and each obs has y (200) nbrids.

  • the goal of running the above code is to create an adjacency matrix to id the nbrids (y) that appear per column. [For example, from the unique nbrids, does y(1) appear in x(i);does y(2)...does y(900)].

  • i am not concerned with x, per se. the end goal is:

From the unique ids throughout the matrix, what uniqueids appear together & how often [this is why I create the logic test to count .n(i)==TRUE]…for those >2, i can filter as it is likely that such rows share nbrids.

Sample end matrix;

  r

    From        To                Weight
    50012       50056             5
    50012       50032             3
    …
    50063      50090              9

Man thats a mouthful _


Solution

  • 2nd Edit:

    These options seem to get to your expected output in your edit. Both options rely on self-joins to look at which combos are there. The first option uses lapply() to do the self-join one column at a time while the latter melt()s and then self-joins the entire dataset. For smaller datasets, lapply() is slower but when trying 7,000 rows, it still came through whereas the melt and self-join created too large of a data frame.

    One additional note, this dataset doesn't really have many unique values. If I knew it was sparse, I'd likely add a line looking to filter out values which were not duplicated in the entire dataset.

    library(data.table)
    
    # generate data -----------------------------------------------------------
    set.seed(1234)
    dt1<- data.table(replicate(43, sample(500000:500900,700, replace = TRUE)))
    
    rbindlist(
      lapply(dt1
           , function(x) {
             nbrid_dt = data.table(nbrid = unique(x))
    
             nbrid_dt[nbrid_dt
                      , on = .(nbrid < nbrid)
                      , j = .(From = x.nbrid, To = i.nbrid)
                      , nomatch = 0L
                      , allow.cartesian = T]
           }
           )
      )[, .N, keyby = .(From, To)]
    
              From     To  N
         1: 500000 500001 11
         2: 500000 500002 11
         3: 500000 500003  7
         4: 500000 500004  9
         5: 500000 500005 13
        ---                 
    405446: 500897 500899 12
    405447: 500897 500900 10
    405448: 500898 500899 13
    405449: 500898 500900 12
    405450: 500899 500900  9
    
    #all at once
    
    molten_dt <- unique(melt(dt1))
    setkey(molten_dt, variable)
    
    molten_dt[molten_dt
              , on = .(value < value
                       ,variable = variable
                        )
              , .(From = x.value, To = i.value)
              , allow.cartesian = TRUE
              , nomatch = 0L
              ][!is.na(From), .N, keyby = .(From, To)]
    

    Original: I'm not fully following, but if you are mainly after the amount of counts in your 43 columns, it may be beneficial to gather / melt the data.

    molten_dt <- melt(dt1)
    
    molten_dt[, N := length(unique(variable)), by = value]
    
          variable  value  N
       1:       V1 500102  9
       2:       V1 500560  8
       3:       V1 500548  9
       4:       V1 500561 12
       5:       V1 500775  9
      ---                                    
    8596:      V43 500096  7
    8597:      V43 500320  6
    8598:      V43 500205 14
    8599:      V43 500711  7
    8600:      V43 500413 11
    
    #or you can aggregate instead of mutate-in-place
    
    molten_dt[, .(N = length(unique(variable))), by = value]
    
          value  N
      1: 500102  9
      2: 500560  8
      3: 500548  9
      4: 500561 12
      5: 500775  9
     ---          
    897: 500753  4
    898: 500759  4
    899: 500816  6
    900: 500772  4
    901: 500446  2
    

    Also, my answer doesn't 100% agree with @Konrad. When there are duplicated values, there seems to be one additional count for @Konrad's solution.

    Data:

    set.seed(1234)
    dt1<- as.data.table(replicate(43, sample(500000 : 500900, 200, replace = TRUE)))
    #h/t for @Konrad for the quick way to make 43 columns
    

    1st Edit: If you are only interested in the count of each value, you can do the following:

    mat_data <- matrix(replicate(43, sample(500000 : 500900, 700, replace = TRUE)), ncol = 43)
    
    table(unlist(apply(mat_data, 2, unique)))
    

    It's the fastest approach but the problem is that you loose information about which column supplied the information.

    Unit: milliseconds
               expr     min      lq     mean   median       uq      max neval
     melt_and_count 53.3914 53.8926 57.38576 55.95545 58.55605  79.2055    20
      table_version 11.0566 11.1814 12.24900 11.56760 12.82110  16.4351    20
     vapply_version 63.1623 64.8274 69.86041 67.84505 71.40635 108.2279    20