Search code examples
rstatisticsprobabilityprobability-density

Error probability function


I have DNA amplicons with base mismatches which can arise during the PCR amplification process. My interest is, what is the probability that a sequence contains errors, given the error rate per base, number of mismatches and the number of bases in the amplicon.

I came across an article [Cummings, S. M. et al (2010). Solutions for PCR, cloning and sequencing errors in population genetic analysis. Conservation Genetics, 11(3), 1095–1097. doi:10.1007/s10592-009-9864-6] that proposes this formula to calculate the probability mass function in such cases.

enter image description here

I implemented the formula with R as shown here

pcr.prob <- function(k,N,eps){
  v = numeric(k)
  for(i in 1:k) {
    v[i] = choose(N,k-i) * (eps^(k-i)) * (1 - eps)^(N-(k-i))
    }
  1 - sum(v)
}

From the article, suggest we analysed an 800 bp amplicon using a PCR of 30 cycles with 1.85e10-5 misincorporations per base per cycle, and found 10 unique sequences that are each 3 bp different from their most similar sequence. The probability that a novel sequences was generated by three independent PCR errors equals P = 0.0011.

However when I use my implementation of the formula I get a different value.

pcr.prob(3,800,0.0000185)
[1] 5.323567e-07

What could I be doing wrong in my implementation? Am I misinterpreting something?

Thanks


Solution

  • I think they've got the right number (0.00113), but badly explained in their paper.

    The calculation you want to be doing is:

    pbinom(3, 800, 1-(1-1.85e-5)^30, lower=FALSE)
    

    I.e. what's the probability of seeing less than three modifications in 800 independent bases, given 30 amplifications that each have a 1.85e-5 chance of going wrong. I.e. you're calculating the probability it doesn't stay correct 30 times.

    Somewhat statsy, may be worth a move…

    Thinking about this more, you will start to see floating-point inaccuracies when working with very small probabilities here. I.e. a 1-x where x is a small number will start to go wrong when the absolute value of x is less than about 1e-10. Working with log-probabilities is a good idea at this point, specifically the log1p function is a great help. Using:

    pbinom(3, 800, 1-exp(log1p(-1.85e-5)*30), lower=FALSE)
    

    will continue to work even when the error incorporation rate is very low.