r probability set-intersection set-union

Probability of the Union of Three or More Sets

Consider the following sets of probabilities (the three events are NOT mutually exclusive):

0.05625 success, 0.94375 failure
0.05625 success, 0.94375 failure
0.05625 success, 0.94375 failure

How do I compute the probability of at least one event happening (i.e. the union)?

If possible I would prefer a generic, self-contained solution that could also handle 4 or more events. In this case the answer I'm looking for is:

0.05625 + 0.05625 + 0.05625 -
0.05625*0.05625 - 0.05625*0.05625 - 0.05625*0.05625 +
0.05625*0.05625*0.05625
##[1] 0.1594358

My question is ultimately a bit broader than the title, as I'm looking for functions that can compute the probability of union, intersection (0.05625*0.05625*0.05625 = 0.0001779785), no event happening (1 - 0.1594358 = 0.8405642) or exactly one event happening (0.150300). In other words an R solution to this on-line Conjunction of three events calculator. I've already looked into the prob package, but it seems to have an interface too complicated for such a simplistic use-case.

Solution

Equal Probabilities

You can get the probability of exactly 0, 1, 2, or 3 of these occurring with the binomial density function dbinom, which returns the probability of getting exactly the number of specified successes (first argument) given the total number of independent attempts (second argument) and the probability of success for each attempt (third argument):

dbinom(0:3, 3, 0.05625)
# [1] 0.8405642090 0.1502995605 0.0089582520 0.0001779785

So if you want the probability of at least one happening, that is:

sum(dbinom(1:3, 3, 0.05625))
# [1] 0.1594358

1 - dbinom(0, 3, 0.05625)
# [1] 0.1594358

The dbinom function can address your other questions as well. For instance, the probability of all happening is:

dbinom(3, 3, 0.05625)
# [1] 0.0001779785

The probability of exactly one is:

dbinom(1, 3, 0.05625)
# [1] 0.1502996

The probability of none is:

dbinom(0, 3, 0.05625)
# [1] 0.8405642

Unequal Probabilities -- Some Easy Cases

If you have unequal probabilities stored in a vector p and each item is selected independently, you need to do a bit more work, because the dbinom function doesn't apply. Still, some of the computations are pretty simple.

The probability of none of the items being selected is simply the product of 1 minus the probabilities (the probability that at least one is selected is simply one minus this):

p <- c(0.1, 0.2, 0.3)
prod(1-p)
# [1] 0.504

The probability of all is the product of the probabilities:

prod(p)
# [1] 0.006

Finally, the probability of exactly one being selected is the sum across all the elements of its probability times the probability of all the other elements not being selected:

sum(p * (prod(1-p) / (1-p)))
# [1] 0.398

Similarly, the probability of exactly n-1 being selected (where n is the number of probabilities) is:

sum((1-p) * (prod(p) / p))
# [1] 0.092

Unequal Probabilities -- Complete Case

If you want the probability of every single possible count of successes, one option could be computing all 2^n event combinations (which is what A. Webb does in their answer). Instead, the following is a O(n^2) scheme:

cp.quadratic <- function(p) {
  P <- matrix(0, nrow=length(p), ncol=length(p))
  P[1,] <- rev(cumsum(rev(p * prod(1-p) / (1-p))))
  for (i in seq(2, length(p))) {
    P[i,] <- c(rev(cumsum(rev(head(p, -1) / (1-head(p, -1)) * tail(P[i-1,], -1)))), 0)
  }
  c(prod(1-p), P[,1])
}
cp.quadratic(c(0.1, 0.2, 0.3))
# [1] 0.504 0.398 0.092 0.006

Basically, we define P_ij to be the probability that we have exactly i successes, all of which are in position j or greater. Base cases for i=0 and i=1 are relatively simple to compute, and then we have the following recurrence:

P_ij = P_i(j+1) + p_j / (1-p_j) * P_(i-1)(j+1)

In the function cp.quadratic, we loop with increasing i, filling out the P matrix (which is n x n). Therefore the total operation count is O(n^2).

This enables you, for instance, to compute the distribution for pretty large numbers of options in under a second:

system.time(cp.quadratic(sample(c(.1, .2, .3), 100, replace=T)))
#    user  system elapsed 
#   0.005   0.000   0.006 
system.time(cp.quadratic(sample(c(.1, .2, .3), 1000, replace=T)))
#    user  system elapsed 
#   0.165   0.043   0.208 
system.time(cp.quadratic(sample(c(.1, .2, .3), 10000, replace=T)))
#    user  system elapsed 
#  12.721   3.161  16.567

We can compute the distribution from 1,000 elements in a fraction of a second and from 10,000 elements in under a minute; computing 2^1000 or 2^10000 possible outcomes would take a prohibitively long time (the number of subsets are a 301-digit and 3010-digit number, respectively).