Search code examples
pythonrandomscipysimulationnormal-distribution

Scipy: sample lognormal distribution from tail only


I'm building a simulation which requires random draws from the tail of a lognormal distribution. A threshold τ (tau) is chosen, and a resulting conditional distribution is given by: Formula for truncated lognormal distribution

I need to randomly sample from that conditional distribution, where F(x) is lognormal with a chosen µ (mu) and σ (sigma), and τ (tau) is set by the user.

My inelegant solution right now is simply to sample from the lognormal, tossing out any values under τ (tau), until I have the sample size I need. But I'm sure this can be improved.

Thanks for the help!


Solution

  • The easiest way is probably to leverage the truncated normal distribution as provided by Scipy.

    This gives the following code, with ν (nu) as the variable of the standard Gaussian distribution, and τ (tau) mapping to ν0 on that distribution. This function returns a Numpy array containing ranCount lognormal variates:

    import  numpy  as  np
    from scipy.stats import truncnorm
    
    def getMySamplesScipy(ranCount, mu, sigma, tau):
        nu0 = (math.log(tau) - mu) / sigma              # position of tau on unit Gaussian
        xs = truncnorm.rvs(nu0, np.inf, size=ranCount)  # truncated unit normal samples
        ys = np.exp(mu + sigma * xs)                    # go back to x space
        return ys
    

    If for some reason this is not suitable, well some of the tricks commonly used for Gaussian variates, such as Box-Muller do not work for a truncated distribution, but we can resort always to a general principle: the Inverse Transform Sampling theorem.

    So we generate cumulative probabilities for our variates, by transforming uniform variates. And we trust Scipy, using its inverse of the erf error function to go back from our probabilities to the x space values.

    This gives something like the following Python code (without any attempt at optimization):

    import  math
    import  random
    import  numpy  as  np
    import  numpy.random  as  nprd
    import  scipy.special as  spfn
    
    # using the "Inverse Method":
    def getMySamples(ranCount, mu, sigma, tau):
        nu0 = (math.log(tau) - mu) / sigma      # position of tau in standard Gaussian curve
        headCP = (1/2) * (1 + spfn.erf(nu0/math.sqrt(2)))
        tailCP = 1.0 - headCP                   # probability of being in the "tail"
    
        uvs = np.random.uniform(0.0, 1.0, ranCount)  # uniform variates
        cps = (headCP + uvs * tailCP)                # Cumulative ProbabilitieS
        nus = (math.sqrt(2)) * spfn.erfinv(2*cps-1)  # positions in standard Gaussian
        xs  = np.exp(mu + sigma * nus)               # go back to x space
        return xs
    

    Alternatives:

    We can leverage the significant amount of material related to the Truncated Gaussian distribution.

    There is a relatively recent (2016) review paper on the subject by Zdravko Botev and Pierre L'Ecuyer. This paper provides a pointer to publicly available R source code. Some material is seriously old, for example the 1986 book by Luc Devroye: Non-Uniform Random Variate Generation.

    For example, a possible rejection-based method: if τ (tau) maps to ν0 on the standard Gaussian curve, the unit Gaussian distribution is like exp(-ν2/2). If we write ν = ν0 + δ, this is proportional to: exp(-δ2/2) * exp(-ν0*δ).

    The idea is to approximate the exact distribution beyond ν0 by an exponential one, of parameter ν0. Note that the exact distribution is constantly below the approximate one. Then we can randomly accept the relatively cheap exponential variates with a probability of exp(-δ2/2).

    We can just pick an equivalent algorithm in the literature. In the Devroye book, chapter IX page 382, there is some pseudo-code:

    REPEAT generate independent exponential random variates X and Y UNTIL X2 <= 2*ν02*Y

    RETURN R <-- ν0 + X/ν0

    for which a Numpy rendition could be written like this:

    def getMySamplesXpRj(rawRanCount, mu, sigma, tau):
        nu0 = (math.log(tau) - mu) / sigma      # position of tau in standard Gaussian
        if (nu0 <= 0):
            print("Error: τ (tau) too small in getMySamplesXpRj")
        rnu0 = 1.0 / nu0
    
        xs  = nprd.exponential(1.0,  rawRanCount)   # exponential "raw" variates
        ys  = nprd.exponential(1.0,  rawRanCount)
    
        allSamples = nu0 + (rnu0 * xs)
        boolArray  = (xs*xs - 2*nu0*nu0*ys) <= 0.0
        samples    = allSamples[boolArray]
    
        ys  = np.exp(mu + sigma * samples)               # go back to x space
        return ys
    

    According to Table 3 in the Botev-L'Ecuyer paper, the rejection rate of this algorithm is nicely low.

    Besides, if you are willing to allow for some sophistication, there is also some literature about the Ziggurat algorithm as used for truncated Gaussian distributions, for example the 2012 arXiv 1201.6140 paper by Nicolas Chopin at ENSAE-CREST.

    Side note: with recent versions of Python, it seems that you can use Greek letters for your variable names directly, σ instead of sigma, τ instead of tau, just as in the statistics books:

    $ python3
    Python 3.9.6 (default, Jun 29 2021, 00:00:00)
    >>> 
    >>> σ = 2
    >>> τ = 7
    >>> 
    >>> στ = σ * τ
    >>> 
    >>> στ + 1
    15
    >>>