I am working with data from a meta-analysis of studies, where not every study had data for every genetic variant. I am trying to solve the following problem:
Here are examples of what the data looks like:
> head(studies,2)
study cases controls N
1 A 3747 8024 11771
2 B 5367 5780 11147
> head(rawresults,2)
ID N
1 rs58241367 65280
2 rs85436064 107624
Here is an example of what I want to make from it (the contributing_studies column in optional, I can get rid of it if doing so allows for a better solution):
> head(final,2)
ID contributing_studies cases controls N
1 rs10685984 B,C,D,F,G 26221 19987 46208
2 rs12123751 A,C,D,G,J 25631 23509 49140
So far my best idea for how to approach this problem is to brute-force it. Each of the ten studies has two possible states (contributing vs. not contributing) for a given total, so that's 2^10 = 1024 possible sums. Some sums may not be unique (there may be more than one combination of study Ns that could create that sum) and I would plan to exclude those as ambiguous. I have included code below with an example of that solution as an answer.
What I want to ask: Does R have a better way to do this? Perhaps some library or function that exists for handling this kind of problem? Or is there something else I could do to make it faster and more efficient?
Here is code to simulate data for the scenario:
set.seed(1)
# Make "studies"
studies <- data.frame(toupper(letters[1:10]),round(rnorm(10,5000,2000)),round(rnorm(10,5000,2000)),stringsAsFactors=F)
colnames(studies) <- c('study','cases','controls')
studies$N <- studies$cases + studies$controls
# Make "rawresults"
rawresults <- data.frame(character(length=50),numeric(length=50),stringsAsFactors=F)
colnames(rawresults) <- c('ID','N')
for(i in seq(1,50)) {
numstudies <- sample(seq(5,10),1)
rawresults[i,'N'] <- sum(sample(studies$N,numstudies))
rawresults[i,'ID'] <- paste0('rs',sample(seq(1,99999999),1))
}
Edit: Faster code to simulate data for the scenario, so it will be possible to simulate millions of rawresults rows. Inspired by Allan Cameron's solution below, which also uses combn.
set.seed(1)
# Make "studies"
studies <- data.frame(toupper(letters[1:10]),round(rnorm(10,5000,2000)),round(rnorm(10,5000,2000)),stringsAsFactors=F)
colnames(studies) <- c('study','cases','controls')
studies$N <- studies$cases + studies$controls
# Make "rawresults"
num_results <- 50 # Number of results to simulate
possible_ns <- unlist(sapply(1:10,combn,x=studies$N,sum))
rawresults <- data.frame(paste0('rs',sample(1:99999999,num_results)),sample(possible_ns,num_results,rep=T),stringsAsFactors=F)
colnames(rawresults) <- c('ID','N')
I think there is certainly a way of making the code a bit shorter by using the combn
function:
getSum <- function(x) colSums(matrix(studies$N[combn(1:10, x)], nrow = x))
getInd <- function(x) apply(combn(1:10, x), 2, function(y) paste(y, collapse = ", "))
all_sums <- do.call(c, lapply(1:10, getSum))
all_inds <- do.call(c, lapply(1:10, getInd))
rawresults$studies <- all_inds[match(rawresults$N, all_sums)]
This yields:
rawresults
#> ID N studies
#> 1 rs58241367 65280 2, 4, 6, 7, 8, 9
#> 2 rs85436064 107624 1, 2, 3, 4, 5, 6, 7, 8, 9, 10
#> 3 rs64407295 107624 1, 2, 3, 4, 5, 6, 7, 8, 9, 10
#> 4 rs78593369 83683 1, 3, 4, 5, 6, 7, 8, 10
#> 5 rs18774630 107624 1, 2, 3, 4, 5, 6, 7, 8, 9, 10
#> 6 rs99681114 49670 3, 6, 7, 9, 10
#> 7 rs8426283 69694 2, 3, 4, 5, 6, 7, 10
#> 8 rs81116968 75972 2, 4, 5, 6, 7, 8, 10
#> 9 rs54871836 55138 1, 3, 5, 9, 10
#> 10 rs21386862 87919 1, 2, 3, 5, 6, 8, 9, 10
#> 11 rs16179951 73195 1, 2, 3, 4, 6, 8, 9
#> 12 rs8492848 74843 1, 3, 4, 5, 7, 9, 10
#> 13 rs81342050 56555 2, 4, 5, 7, 9
#> 14 rs44945811 59794 1, 2, 3, 6, 7, 10
#> 15 rs43997413 94715 1, 2, 3, 4, 6, 7, 8, 9, 10
#> 16 rs97055078 94715 1, 2, 3, 4, 6, 7, 8, 9, 10
#> 17 rs19941161 51106 1, 3, 4, 5, 10
#> 18 rs30748262 94259 1, 2, 3, 4, 5, 6, 7, 9, 10
#> 19 rs10959349 64914 2, 4, 6, 8, 9, 10
#> 20 rs982457 99355 1, 2, 3, 4, 5, 7, 8, 9, 10
#> 21 rs2007022 107624 1, 2, 3, 4, 5, 6, 7, 8, 9, 10
#> 22 rs21833759 100537 1, 2, 4, 5, 6, 7, 8, 9, 10
#> 23 rs74715222 87172 1, 2, 4, 5, 6, 7, 9, 10
#> 24 rs49182929 54248 4, 5, 6, 7, 8
#> 25 rs95501056 64548 1, 2, 3, 5, 6, 8
#> 26 rs57556390 53270 2, 3, 4, 5, 8
#> 27 rs98150573 107624 1, 2, 3, 4, 5, 6, 7, 8, 9, 10
#> 28 rs12123751 49140 1, 3, 4, 7, 10
#> 29 rs97971902 107624 1, 2, 3, 4, 5, 6, 7, 8, 9, 10
#> 30 rs89722202 50848 2, 3, 4, 5, 7
#> 31 rs98543720 65904 1, 4, 6, 7, 8, 9
#> 32 rs31961269 70318 1, 3, 4, 5, 6, 7, 10
#> 33 rs9764985 107624 1, 2, 3, 4, 5, 6, 7, 8, 9, 10
#> 34 rs10685984 46208 2, 3, 4, 6, 7
#> 35 rs34912847 77631 1, 3, 4, 5, 7, 8, 9
#> 36 rs94227949 76514 2, 3, 5, 6, 7, 8, 9
#> 37 rs74751789 67467 1, 2, 5, 6, 9, 10
#> 38 rs47441925 66565 1, 2, 4, 7, 8, 10
#> 39 rs4502074 69920 2, 4, 5, 7, 8, 9
#> 40 rs2741222 57384 1, 4, 5, 8, 10
#> 41 rs11561555 79017 1, 2, 4, 5, 6, 8, 9
#> 42 rs96740802 85900 1, 3, 4, 5, 6, 7, 8, 9
#> 43 rs7081648 96681 1, 2, 3, 4, 5, 6, 8, 9, 10
#> 44 rs68842412 74957 1, 3, 4, 5, 6, 8, 9
#> 45 rs78557028 107624 1, 2, 3, 4, 5, 6, 7, 8, 9, 10
#> 46 rs75435758 84706 3, 4, 5, 6, 7, 8, 9, 10
#> 47 rs58940608 58751 2, 3, 4, 5, 6, 10
#> 48 rs25060056 66639 2, 5, 6, 7, 9, 10
#> 49 rs98833089 99355 1, 2, 3, 4, 5, 7, 8, 9, 10
#> 50 rs70239332 94259 1, 2, 3, 4, 5, 6, 7, 9, 10
This takes under 10 milliseconds to find all the combinations of N and generate the index strings. It only does this once, then the efficiency is just down to the efficiency of the match
function, which is about as good as R offers for this kind of algorithm. The whole thing runs in about 11ms on my modest machine.
Created on 2020-06-29 by the reprex package (v0.3.0)