I am processing a large dataset (after being cleaned). The data set is then processed to create an adjacency matrix, which is passed a logicEval to id obs that contain the uniqueID. 5
When running the code snippet to create adjacency matrix, the process takes a huge amount of time to process (and sometimes, it just freezes).
Obviously, this is because the function is checking each of the unique elements (n=10901) and marking TRUE/FALSE if it appears in the observation. An example (greatly reduced):
|Obs_1 |Obs_2 |Obs_3 |Obs_4 |Obs_5 | logEval|
|:-----|:-----|:-----|:-----|:-----|-------:|
|TRUE |FALSE |FALSE |FALSE |FALSE | 1|
|FALSE |TRUE |FALSE |FALSE |FALSE | 1|
|FALSE |FALSE |TRUE |FALSE |FALSE | 1|
|FALSE |FALSE |FALSE |TRUE |FALSE | 1|
|FALSE |FALSE |FALSE |FALSE |TRUE | 1|
|FALSE |FALSE |FALSE |FALSE |TRUE | 1|
|FALSE |FALSE |FALSE |FALSE |FALSE | 0|
|FALSE |FALSE |FALSE |FALSE |FALSE | 0|
|FALSE |FALSE |TRUE |FALSE |FALSE | 1|
|TRUE |FALSE |FALSE |FALSE |FALSE | 1|
|FALSE |FALSE |FALSE |FALSE |TRUE | 1|
|FALSE |FALSE |FALSE |FALSE |FALSE | 0|
|FALSE |FALSE |FALSE |FALSE |FALSE | 0|
In actuality, the Obs=43 and there are >10 0000 comparisons.
Problem: R crashes. Is there a better way to run this without having it crash due to size?
Code snippet:
r
df1<-data.table(col1=sample(500000:500900,700,replace = T),
col2=sample(500000:500900,700,replace = T),
col3=sample(500000:500900,700,replace = T),
col4=sample(500000:500900,700,replace = T),
col5 = sample(500000:500900,700,replace = T),
col6 = sample(500000:500900,700,replace = T),
col7 = sample(500000:500900,700,replace = T),
col8 = sample(500000:500900,700,replace = T),
col9 = sample(500000:500900,700,replace = T),
col10 = sample(500000:500900,700,replace = T),
col11 = sample(500000:500900,700,replace = T),
col12 = sample(500000:500900,700,replace = T),
col13 = sample(500000:500900,700,replace = T),
col14 = sample(500000:500900,700,replace = T),
col15 = sample(500000:500900,700,replace = T),
col16 = sample(500000:500900,700,replace = T),
col17 = sample(500000:500900,700,replace = T),
col18 = sample(500000:500900,700,replace = T),
col19 = sample(500000:500900,700,replace = T),
col20 = sample(500000:500900,700,replace = T),
col21 = sample(500000:500900,700,replace = T),
col22 = sample(500000:500900,700,replace = T),
col23 = sample(500000:500900,700,replace = T),
col24 = sample(500000:500900,700,replace = T),
col25 = sample(500000:500900,700,replace = T),
col26 = sample(500000:500900,700,replace = T),
col27 = sample(500000:500900,700,replace = T),
col28 = sample(500000:500900,700,replace = T),
col29 = sample(500000:500900,700,replace = T),
col30 = sample(500000:500900,700,replace = T),
col31 = sample(500000:500900,700,replace = T),
col32 = sample(500000:500900,700,replace = T),
col33 = sample(500000:500900,700,replace = T),
col34 = sample(500000:500900,700,replace = T),
col35 = sample(500000:500900,700,replace = T),
col36 = sample(500000:500900,700,replace = T),
col37 = sample(500000:500900,700,replace = T),
col38 = sample(500000:500900,700,replace = T),
col39 = sample(500000:500900,700,replace = T),
col40 = sample(500000:500900,700,replace = T),
col41 = sample(500000:500900,700,replace = T),
col42 = sample(500000:500900,700,replace = T),
col43 = sample(500000:500900,700,replace = T))
#find all ids via table
uniqueIDs<-as.character(unique(unlist(df1)))
df1<-data.table(df1)
#creating adjacency matrix
mat <- sapply(uniqueIDs, function(s) apply(dt1, 1, function(x) s %in% x))
#clean-up
colnames(mat) <- uniqueIDs
rownames(mat) <- paste0("row", seq(nrow(dt1)))
mat<-data.table(mat)
mat<-data.table(t(mat))
#apply logical evaluation to count number of TRUE
mat$logEval<-rowSums(mat==TRUE)
Want to make a small update to ensure I am making my overall goal clear:
-dataset has x (43) obs and each obs has y (200) nbrids.
the goal of running the above code is to create an adjacency matrix to id the nbrids (y) that appear per column. [For example, from the unique nbrids, does y(1) appear in x(i);does y(2)...does y(900)].
i am not concerned with x, per se. the end goal is:
From the unique ids throughout the matrix, what uniqueids appear together & how often [this is why I create the logic test to count .n(i)==TRUE]…for those >2, i can filter as it is likely that such rows share nbrids.
Sample end matrix;
r
From To Weight
50012 50056 5
50012 50032 3
…
50063 50090 9
Man thats a mouthful _
2nd Edit:
These options seem to get to your expected output in your edit. Both options rely on self-joins to look at which combos are there. The first option uses lapply()
to do the self-join one column at a time while the latter melt()
s and then self-joins the entire dataset. For smaller datasets, lapply()
is slower but when trying 7,000 rows, it still came through whereas the melt and self-join created too large of a data frame.
One additional note, this dataset doesn't really have many unique values. If I knew it was sparse, I'd likely add a line looking to filter out values which were not duplicated in the entire dataset.
library(data.table)
# generate data -----------------------------------------------------------
set.seed(1234)
dt1<- data.table(replicate(43, sample(500000:500900,700, replace = TRUE)))
rbindlist(
lapply(dt1
, function(x) {
nbrid_dt = data.table(nbrid = unique(x))
nbrid_dt[nbrid_dt
, on = .(nbrid < nbrid)
, j = .(From = x.nbrid, To = i.nbrid)
, nomatch = 0L
, allow.cartesian = T]
}
)
)[, .N, keyby = .(From, To)]
From To N
1: 500000 500001 11
2: 500000 500002 11
3: 500000 500003 7
4: 500000 500004 9
5: 500000 500005 13
---
405446: 500897 500899 12
405447: 500897 500900 10
405448: 500898 500899 13
405449: 500898 500900 12
405450: 500899 500900 9
#all at once
molten_dt <- unique(melt(dt1))
setkey(molten_dt, variable)
molten_dt[molten_dt
, on = .(value < value
,variable = variable
)
, .(From = x.value, To = i.value)
, allow.cartesian = TRUE
, nomatch = 0L
][!is.na(From), .N, keyby = .(From, To)]
Original: I'm not fully following, but if you are mainly after the amount of counts in your 43 columns, it may be beneficial to gather / melt the data.
molten_dt <- melt(dt1)
molten_dt[, N := length(unique(variable)), by = value]
variable value N
1: V1 500102 9
2: V1 500560 8
3: V1 500548 9
4: V1 500561 12
5: V1 500775 9
---
8596: V43 500096 7
8597: V43 500320 6
8598: V43 500205 14
8599: V43 500711 7
8600: V43 500413 11
#or you can aggregate instead of mutate-in-place
molten_dt[, .(N = length(unique(variable))), by = value]
value N
1: 500102 9
2: 500560 8
3: 500548 9
4: 500561 12
5: 500775 9
---
897: 500753 4
898: 500759 4
899: 500816 6
900: 500772 4
901: 500446 2
Also, my answer doesn't 100% agree with @Konrad. When there are duplicated values, there seems to be one additional count for @Konrad's solution.
Data:
set.seed(1234)
dt1<- as.data.table(replicate(43, sample(500000 : 500900, 200, replace = TRUE)))
#h/t for @Konrad for the quick way to make 43 columns
1st Edit: If you are only interested in the count of each value, you can do the following:
mat_data <- matrix(replicate(43, sample(500000 : 500900, 700, replace = TRUE)), ncol = 43)
table(unlist(apply(mat_data, 2, unique)))
It's the fastest approach but the problem is that you loose information about which column supplied the information.
Unit: milliseconds
expr min lq mean median uq max neval
melt_and_count 53.3914 53.8926 57.38576 55.95545 58.55605 79.2055 20
table_version 11.0566 11.1814 12.24900 11.56760 12.82110 16.4351 20
vapply_version 63.1623 64.8274 69.86041 67.84505 71.40635 108.2279 20