Search code examples
rsumdeconvolution

Does R have a way to figure out which numbers contributed to a sum? (deconvolution?)


I am working with data from a meta-analysis of studies, where not every study had data for every genetic variant. I am trying to solve the following problem:

  • I have a data frame ("studies") with information about 10 studies, which includes the number of cases and controls in each study and the total study sample size (N = cases + controls).
  • I also have a data frame ("rawresults") with approximately 10 million rows, each of which was generated by summing some unknown subset of the sample sizes (N) from the table of studies.
  • I need to figure out which studies were added up to create each total

Here are examples of what the data looks like:

> head(studies,2)
  study cases controls     N
1     A  3747     8024 11771
2     B  5367     5780 11147

> head(rawresults,2)
          ID      N
1 rs58241367  65280
2 rs85436064 107624

Here is an example of what I want to make from it (the contributing_studies column in optional, I can get rid of it if doing so allows for a better solution):

> head(final,2)
          ID contributing_studies cases controls     N
1 rs10685984            B,C,D,F,G 26221    19987 46208
2 rs12123751            A,C,D,G,J 25631    23509 49140

So far my best idea for how to approach this problem is to brute-force it. Each of the ten studies has two possible states (contributing vs. not contributing) for a given total, so that's 2^10 = 1024 possible sums. Some sums may not be unique (there may be more than one combination of study Ns that could create that sum) and I would plan to exclude those as ambiguous. I have included code below with an example of that solution as an answer.

What I want to ask: Does R have a better way to do this? Perhaps some library or function that exists for handling this kind of problem? Or is there something else I could do to make it faster and more efficient?

Here is code to simulate data for the scenario:

set.seed(1)

# Make "studies"
studies <- data.frame(toupper(letters[1:10]),round(rnorm(10,5000,2000)),round(rnorm(10,5000,2000)),stringsAsFactors=F)
colnames(studies) <- c('study','cases','controls')
studies$N <- studies$cases + studies$controls

# Make "rawresults"
rawresults <- data.frame(character(length=50),numeric(length=50),stringsAsFactors=F)
colnames(rawresults) <- c('ID','N')
for(i in seq(1,50)) {
  numstudies <- sample(seq(5,10),1)
  rawresults[i,'N'] <- sum(sample(studies$N,numstudies))
  rawresults[i,'ID'] <- paste0('rs',sample(seq(1,99999999),1))
}

Edit: Faster code to simulate data for the scenario, so it will be possible to simulate millions of rawresults rows. Inspired by Allan Cameron's solution below, which also uses combn.

set.seed(1)

# Make "studies"
studies <- data.frame(toupper(letters[1:10]),round(rnorm(10,5000,2000)),round(rnorm(10,5000,2000)),stringsAsFactors=F)
colnames(studies) <- c('study','cases','controls')
studies$N <- studies$cases + studies$controls

# Make "rawresults"
num_results <- 50 # Number of results to simulate
possible_ns <- unlist(sapply(1:10,combn,x=studies$N,sum))
rawresults <- data.frame(paste0('rs',sample(1:99999999,num_results)),sample(possible_ns,num_results,rep=T),stringsAsFactors=F)
colnames(rawresults) <- c('ID','N')

Solution

  • I think there is certainly a way of making the code a bit shorter by using the combn function:

    getSum <- function(x) colSums(matrix(studies$N[combn(1:10, x)], nrow = x))
    getInd <- function(x) apply(combn(1:10, x), 2, function(y) paste(y, collapse = ", "))
    all_sums <- do.call(c, lapply(1:10, getSum))
    all_inds <- do.call(c, lapply(1:10, getInd))
    rawresults$studies <- all_inds[match(rawresults$N, all_sums)]
    

    This yields:

    rawresults
    #>            ID      N                       studies
    #> 1  rs58241367  65280              2, 4, 6, 7, 8, 9
    #> 2  rs85436064 107624 1, 2, 3, 4, 5, 6, 7, 8, 9, 10
    #> 3  rs64407295 107624 1, 2, 3, 4, 5, 6, 7, 8, 9, 10
    #> 4  rs78593369  83683       1, 3, 4, 5, 6, 7, 8, 10
    #> 5  rs18774630 107624 1, 2, 3, 4, 5, 6, 7, 8, 9, 10
    #> 6  rs99681114  49670                3, 6, 7, 9, 10
    #> 7   rs8426283  69694          2, 3, 4, 5, 6, 7, 10
    #> 8  rs81116968  75972          2, 4, 5, 6, 7, 8, 10
    #> 9  rs54871836  55138                1, 3, 5, 9, 10
    #> 10 rs21386862  87919       1, 2, 3, 5, 6, 8, 9, 10
    #> 11 rs16179951  73195           1, 2, 3, 4, 6, 8, 9
    #> 12  rs8492848  74843          1, 3, 4, 5, 7, 9, 10
    #> 13 rs81342050  56555                 2, 4, 5, 7, 9
    #> 14 rs44945811  59794             1, 2, 3, 6, 7, 10
    #> 15 rs43997413  94715    1, 2, 3, 4, 6, 7, 8, 9, 10
    #> 16 rs97055078  94715    1, 2, 3, 4, 6, 7, 8, 9, 10
    #> 17 rs19941161  51106                1, 3, 4, 5, 10
    #> 18 rs30748262  94259    1, 2, 3, 4, 5, 6, 7, 9, 10
    #> 19 rs10959349  64914             2, 4, 6, 8, 9, 10
    #> 20   rs982457  99355    1, 2, 3, 4, 5, 7, 8, 9, 10
    #> 21  rs2007022 107624 1, 2, 3, 4, 5, 6, 7, 8, 9, 10
    #> 22 rs21833759 100537    1, 2, 4, 5, 6, 7, 8, 9, 10
    #> 23 rs74715222  87172       1, 2, 4, 5, 6, 7, 9, 10
    #> 24 rs49182929  54248                 4, 5, 6, 7, 8
    #> 25 rs95501056  64548              1, 2, 3, 5, 6, 8
    #> 26 rs57556390  53270                 2, 3, 4, 5, 8
    #> 27 rs98150573 107624 1, 2, 3, 4, 5, 6, 7, 8, 9, 10
    #> 28 rs12123751  49140                1, 3, 4, 7, 10
    #> 29 rs97971902 107624 1, 2, 3, 4, 5, 6, 7, 8, 9, 10
    #> 30 rs89722202  50848                 2, 3, 4, 5, 7
    #> 31 rs98543720  65904              1, 4, 6, 7, 8, 9
    #> 32 rs31961269  70318          1, 3, 4, 5, 6, 7, 10
    #> 33  rs9764985 107624 1, 2, 3, 4, 5, 6, 7, 8, 9, 10
    #> 34 rs10685984  46208                 2, 3, 4, 6, 7
    #> 35 rs34912847  77631           1, 3, 4, 5, 7, 8, 9
    #> 36 rs94227949  76514           2, 3, 5, 6, 7, 8, 9
    #> 37 rs74751789  67467             1, 2, 5, 6, 9, 10
    #> 38 rs47441925  66565             1, 2, 4, 7, 8, 10
    #> 39  rs4502074  69920              2, 4, 5, 7, 8, 9
    #> 40  rs2741222  57384                1, 4, 5, 8, 10
    #> 41 rs11561555  79017           1, 2, 4, 5, 6, 8, 9
    #> 42 rs96740802  85900        1, 3, 4, 5, 6, 7, 8, 9
    #> 43  rs7081648  96681    1, 2, 3, 4, 5, 6, 8, 9, 10
    #> 44 rs68842412  74957           1, 3, 4, 5, 6, 8, 9
    #> 45 rs78557028 107624 1, 2, 3, 4, 5, 6, 7, 8, 9, 10
    #> 46 rs75435758  84706       3, 4, 5, 6, 7, 8, 9, 10
    #> 47 rs58940608  58751             2, 3, 4, 5, 6, 10
    #> 48 rs25060056  66639             2, 5, 6, 7, 9, 10
    #> 49 rs98833089  99355    1, 2, 3, 4, 5, 7, 8, 9, 10
    #> 50 rs70239332  94259    1, 2, 3, 4, 5, 6, 7, 9, 10
    

    This takes under 10 milliseconds to find all the combinations of N and generate the index strings. It only does this once, then the efficiency is just down to the efficiency of the match function, which is about as good as R offers for this kind of algorithm. The whole thing runs in about 11ms on my modest machine.

    Created on 2020-06-29 by the reprex package (v0.3.0)