Search code examples
rrandomexponential-distribution

how to generate random numbers (probabilities) from exponential distribution that sum up to 1


Consider I want x random numbers that sum up to one and that distribution is exponential. When I use

x<-c(10,100,1000)

a<-rexp(x[3],rate=1)

a<-a/sum(a)

This will change the distribution, right?

So does anybody know a way so that the probabilities are still exponential distributed? I know that they will then not be completely independent anymore.

Thanks a lot!


Solution

  • Yes, normalization changes the distribution and, in fact, it is not possible to achieve precisely what you want.


    Straightforward proof

    Let X1, …, Xn for some finite n be random variables whose values you want to generate. The two requirements that you have are

    1. Xi~Exp(λ) for some λ>0 and i=1,…,n.
    2. X1+…+Xn=1.

    While each of the two individual requirements are easy to fulfil, it’s not possible to have both at the same time. The reason for that is that the probability density function of an exponential distribution is positive on [0,∞). That means that each Xi attains values greater than 1 with positive probability, which means that the requirement 2 does not always hold. In fact, it holds with zero probability.


    Probability distribution implied by the normalization

    Now you propose an intuitive approach to start with the requirement 1 and to perform normalization Zi = Xi / (X1+…+Xn) for each i=1,…,n. However, few distributions behave nicely under transformations such as addition, multiplication, and particularly division, because random denominators are rarely tractable. In this case, we have additional complication that the numerator and denominator of Zi are dependent.

    Nevertheless, the name of the exact distribution of Zi is actually known and it is the Dirichlet distribution. To see that, note that Xi~Gamma(1,λ), where λ acts as the rate parameter. Next, we look at a definition of the Dirichlet distribution: we start with Yi~Gamma(αi, θ) for i=1,…,n and then, just like you propose, define Wi=Yi / (Y1+…+Yn). Then (W1,…,Wn)~Dirichlet(αi,…,αn). However, in the case of requirement 1, we have that αi=1 for each i=1,…,n. Thus, your approach leads to (Z1,…,Zn)~Dirichlet(1,…,1).

    You may then simulate values from it using, e.g., the MCMCpack package:

    library(MCMCpack)
    rdirichlet(1, c(1, 1, 1))
    #           [,1]      [,2]       [,3]
    # [1,] 0.2088649 0.7444334 0.04670173
    sum(rdirichlet(1, c(1, 1, 1)))
    # [1] 1
    

    Now looking at the probability density function of Dirichlet(1,...,1) you can notice that it actually is constant (when positive). So, in a way you may see it as a multivariate uniform one. It makes sense if you think about it for a second (e.g., think if points on x+y=1, x+y+z=1).

    The multivariate distribution being somewhat uniform, however, does not imply something similar in terms of marginal distributions. In fact, one can show that they are Beta(1, n-1).

    On Zi being restricted to [0,1]

    Since, for certain values of λ, exponential random variables are concentrated close to zero, one may falsely think that they actually have a bounded support.

    The cumulative distribution function of Xi~Exp(λ) is 1-exp(-λx). So then P(Xi<=1)=1-exp(-λ) which is 1 only in the limit as λ->∞, but in that case X converges to 0 in distribution. So, we cannot have a nondegenerate exponential random variable restricted to [0,1]. Notice, though, that for large fixed values of λ 1-exp(-λ) is close to 1 and one may falsely think that Xi is actually restricted to [0,1].

    A couple of trivial demonstrations. First, Zi (following the Dirichlet distribution) are restricted to [0,1].

    data <- replicate({
      x <- rexp(5)
      z <- x[1] / sum(x)}, n = 100000)
    range(data)
    # [1] 1.060492e-06 9.633081e-01
    plot(density(data, bw = 0.01))
    

    enter image description here

    Second, X~Exp(1) clearly takes values above 1.

    x <- rexp(10000)
    range(x)
    # [1] 7.737341e-05 1.005980e+01
    mean(x < 1)
    # [1] 0.6391
    plot(density(x))
    

    enter image description here


    Scaling by a positive factor

    There were multiple comments proposing to use the fact that the exponential distribution is closed under scaling by a positive factor so that if X ~ Exp(λ), then kX ~ Exp(λ/k). That certainly is true, but it’s not applicable in the current case. The reason is that k = X1+…+Xn is not a constant (meaning that k is different for different realizations of Xi) and, for this reason, kX ~ Exp(λ/k) does not hold. Now if we considered k as a constant (e.g., 5), then there would be no guarantee that Zi = Xi / 5 will satisfy your requirement 2. In fact, the constraint would hold with probability 0.

    To have a clear understanding of what's happening and not to be misled by empirical "proofs" of @MauritsEvers, here are some more details.

    Let (Ω,F,P) be the probability space. Then Xi:Ω->R; i.e., Xi is a function taking values Xi(ω) in R, with outcomes ω (imagine them as set.seed values) from Ω. Now we indeed have this property that, for a constant k, kXi~Exp(λ/k). By constant, however, it is meant that regardless of the realized outcome ω from Ω, the value of k is always the same, as if k:Ω->R was a constant function. What @MauritsEvers proposes is k = X1+…+Xn. That, however, seen as a function, is not constant, and depends on the outcome ω.

    Some trivial examples demonstrating how this logic fails are the following: let k=1/Xi. Then kXi=1, which is a degenerate random variable, not an exponential one. Similarly, if X~N(0,1), then kX=1 and not kX~N(0,1/X^2), which would "follow" from the fact that X~N(0,1) gives kX ~ N(0,k^2) for a constant k.


    Erroneous logic

    Now the origin of this erroneous logic described above could be said to be mishandling probabilistic concepts + dealing directly with simulated values in R. @MauritsEvers claims that if we run

    n <- 3
    x <- rexp(n)
    k <- sum(x)
    

    then the realized sum k can be used as the constant k mentioned above and to expect that kXi~Exp(?). Sanity check of taking n <- 1, as in the example above, already shows that there is something wrong with this kind of argument since then x / k is simply 1 — a degenerate random variable, not an exponential one. It is claimed that k <- sum(x) is a valid choice because it is a number of already observed realizations. That actually is the reason why this choice is invalid. In the notation from before, we have k(ω) = X1(ω)+…+Xn(ω) so that k is not a constant function.

    Another way to look at it is that, if we see x as somehow random, then k is just as random as it is the sum of x. Now both x and k are numbers, realizations, but we know neither of their values before we ask R to print them. A definition of a constant k would be that we always know its value, regardless of ω or set.seed.

    Lastly, as an undergraduate exercise, one may consider looking at the CDF of kXi:

    P(kXi <= x) = P(Xi <= x/k) = 1-exp(-λx/k)

    and hence kXi~Exp(λ/k), as expected. Now take n <- 2. In that case we are dealing with

    P(X1 / (X1 + X2) <= x)

    and we no longer can get rid of the complicated denominator so easily. Surely we can define a constant k = X1(ω)+…+Xn(ω) for some fixed ω from Ω. But then Zi = Xi / (X1(ω)+…+Xn(ω)) are no longer restricted to [0,1] and requirement 2 fails again.


    False empirical "proofs"

    Lastly, one may ask, why does the empirical “proof” of @MauritsEvers partially (since simulation + fitting + hypothesis testing is far from a theoretical proof) claims that Zi actually do follow an exponential distribution.

    A crucial element of this “proof” was to take lambda <- 1 and n <- 1000, a relatively large value. In that case we have that

    Zi = Xi/(X1+…+Xn) ≈ Xi / n * n / (X1+…+Xn).

    The second term on the right hand side, by the law of large numbers, goes to λ — a fixed number — while the first term follows, as we know, Exp(λn). So, for a large n we get an approximation of Zi as λExp(λn). However, the original question is not about approximations or limiting distributions.


    Summary

    We may distinguish the following three cases:

    1. Small n. (Z1, …, Zn) follows the Dirichlet(1,…,1) distribution and the marginal distributions are not equivalent to the exponential ones. Approximating them by exponential ones gives arbitrary poor results.
    2. Large n. (Z1, …, Zn) still follows the Dirichlet(1,…,1) distribution and the marginal distributions still are not equivalent to the exponential ones. Approximating them by exponential ones should give perfectly valid results for practical purposes, however.
    3. Limiting case when n->∞. As n grows, each Zi gets closer and closer to λExp(λn). However, as we saw, λExp(λn) tends to a degenerate random variable identically equal to zero.