Search code examples
rprobabilityset-intersectionset-union

Probability of the Union of Three or More Sets


Consider the following sets of probabilities (the three events are NOT mutually exclusive):

  • 0.05625 success, 0.94375 failure
  • 0.05625 success, 0.94375 failure
  • 0.05625 success, 0.94375 failure

How do I compute the probability of at least one event happening (i.e. the union)?

If possible I would prefer a generic, self-contained solution that could also handle 4 or more events. In this case the answer I'm looking for is:

0.05625 + 0.05625 + 0.05625 -
0.05625*0.05625 - 0.05625*0.05625 - 0.05625*0.05625 +
0.05625*0.05625*0.05625
##[1] 0.1594358

My question is ultimately a bit broader than the title, as I'm looking for functions that can compute the probability of union, intersection (0.05625*0.05625*0.05625 = 0.0001779785), no event happening (1 - 0.1594358 = 0.8405642) or exactly one event happening (0.150300). In other words an R solution to this on-line Conjunction of three events calculator. I've already looked into the prob package, but it seems to have an interface too complicated for such a simplistic use-case.


Solution

  • Equal Probabilities

    You can get the probability of exactly 0, 1, 2, or 3 of these occurring with the binomial density function dbinom, which returns the probability of getting exactly the number of specified successes (first argument) given the total number of independent attempts (second argument) and the probability of success for each attempt (third argument):

    dbinom(0:3, 3, 0.05625)
    # [1] 0.8405642090 0.1502995605 0.0089582520 0.0001779785
    

    So if you want the probability of at least one happening, that is:

    sum(dbinom(1:3, 3, 0.05625))
    # [1] 0.1594358
    

    or

    1 - dbinom(0, 3, 0.05625)
    # [1] 0.1594358
    

    The dbinom function can address your other questions as well. For instance, the probability of all happening is:

    dbinom(3, 3, 0.05625)
    # [1] 0.0001779785
    

    The probability of exactly one is:

    dbinom(1, 3, 0.05625)
    # [1] 0.1502996
    

    The probability of none is:

    dbinom(0, 3, 0.05625)
    # [1] 0.8405642
    

    Unequal Probabilities -- Some Easy Cases

    If you have unequal probabilities stored in a vector p and each item is selected independently, you need to do a bit more work, because the dbinom function doesn't apply. Still, some of the computations are pretty simple.

    The probability of none of the items being selected is simply the product of 1 minus the probabilities (the probability that at least one is selected is simply one minus this):

    p <- c(0.1, 0.2, 0.3)
    prod(1-p)
    # [1] 0.504
    

    The probability of all is the product of the probabilities:

    prod(p)
    # [1] 0.006
    

    Finally, the probability of exactly one being selected is the sum across all the elements of its probability times the probability of all the other elements not being selected:

    sum(p * (prod(1-p) / (1-p)))
    # [1] 0.398
    

    Similarly, the probability of exactly n-1 being selected (where n is the number of probabilities) is:

    sum((1-p) * (prod(p) / p))
    # [1] 0.092
    

    Unequal Probabilities -- Complete Case

    If you want the probability of every single possible count of successes, one option could be computing all 2^n event combinations (which is what A. Webb does in their answer). Instead, the following is a O(n^2) scheme:

    cp.quadratic <- function(p) {
      P <- matrix(0, nrow=length(p), ncol=length(p))
      P[1,] <- rev(cumsum(rev(p * prod(1-p) / (1-p))))
      for (i in seq(2, length(p))) {
        P[i,] <- c(rev(cumsum(rev(head(p, -1) / (1-head(p, -1)) * tail(P[i-1,], -1)))), 0)
      }
      c(prod(1-p), P[,1])
    }
    cp.quadratic(c(0.1, 0.2, 0.3))
    # [1] 0.504 0.398 0.092 0.006
    

    Basically, we define P_ij to be the probability that we have exactly i successes, all of which are in position j or greater. Base cases for i=0 and i=1 are relatively simple to compute, and then we have the following recurrence:

    P_ij = P_i(j+1) + p_j / (1-p_j) * P_(i-1)(j+1)
    

    In the function cp.quadratic, we loop with increasing i, filling out the P matrix (which is n x n). Therefore the total operation count is O(n^2).

    This enables you, for instance, to compute the distribution for pretty large numbers of options in under a second:

    system.time(cp.quadratic(sample(c(.1, .2, .3), 100, replace=T)))
    #    user  system elapsed 
    #   0.005   0.000   0.006 
    system.time(cp.quadratic(sample(c(.1, .2, .3), 1000, replace=T)))
    #    user  system elapsed 
    #   0.165   0.043   0.208 
    system.time(cp.quadratic(sample(c(.1, .2, .3), 10000, replace=T)))
    #    user  system elapsed 
    #  12.721   3.161  16.567 
    

    We can compute the distribution from 1,000 elements in a fraction of a second and from 10,000 elements in under a minute; computing 2^1000 or 2^10000 possible outcomes would take a prohibitively long time (the number of subsets are a 301-digit and 3010-digit number, respectively).