python random scipy simulation normal-distribution

Scipy: sample lognormal distribution from tail only

I'm building a simulation which requires random draws from the tail of a lognormal distribution. A threshold τ (tau) is chosen, and a resulting conditional distribution is given by:

I need to randomly sample from that conditional distribution, where F(x) is lognormal with a chosen µ (mu) and σ (sigma), and τ (tau) is set by the user.

My inelegant solution right now is simply to sample from the lognormal, tossing out any values under τ (tau), until I have the sample size I need. But I'm sure this can be improved.

Thanks for the help!

Solution

The easiest way is probably to leverage the truncated normal distribution as provided by Scipy.

This gives the following code, with ν (nu) as the variable of the standard Gaussian distribution, and τ (tau) mapping to ν₀ on that distribution. This function returns a Numpy array containing ranCount lognormal variates:

import  numpy  as  np
from scipy.stats import truncnorm

def getMySamplesScipy(ranCount, mu, sigma, tau):
    nu0 = (math.log(tau) - mu) / sigma              # position of tau on unit Gaussian
    xs = truncnorm.rvs(nu0, np.inf, size=ranCount)  # truncated unit normal samples
    ys = np.exp(mu + sigma * xs)                    # go back to x space
    return ys

If for some reason this is not suitable, well some of the tricks commonly used for Gaussian variates, such as Box-Muller do not work for a truncated distribution, but we can resort always to a general principle: the Inverse Transform Sampling theorem.

So we generate cumulative probabilities for our variates, by transforming uniform variates. And we trust Scipy, using its inverse of the erf error function to go back from our probabilities to the x space values.

This gives something like the following Python code (without any attempt at optimization):

import  math
import  random
import  numpy  as  np
import  numpy.random  as  nprd
import  scipy.special as  spfn

# using the "Inverse Method":
def getMySamples(ranCount, mu, sigma, tau):
    nu0 = (math.log(tau) - mu) / sigma      # position of tau in standard Gaussian curve
    headCP = (1/2) * (1 + spfn.erf(nu0/math.sqrt(2)))
    tailCP = 1.0 - headCP                   # probability of being in the "tail"

    uvs = np.random.uniform(0.0, 1.0, ranCount)  # uniform variates
    cps = (headCP + uvs * tailCP)                # Cumulative ProbabilitieS
    nus = (math.sqrt(2)) * spfn.erfinv(2*cps-1)  # positions in standard Gaussian
    xs  = np.exp(mu + sigma * nus)               # go back to x space
    return xs

Alternatives:

We can leverage the significant amount of material related to the Truncated Gaussian distribution.

There is a relatively recent (2016) review paper on the subject by Zdravko Botev and Pierre L'Ecuyer. This paper provides a pointer to publicly available R source code. Some material is seriously old, for example the 1986 book by Luc Devroye: Non-Uniform Random Variate Generation.

For example, a possible rejection-based method: if τ (tau) maps to ν₀ on the standard Gaussian curve, the unit Gaussian distribution is like exp(-ν²/2). If we write ν = ν₀ + δ, this is proportional to: exp(-δ²/2) * exp(-ν₀*δ).

The idea is to approximate the exact distribution beyond ν₀ by an exponential one, of parameter ν₀. Note that the exact distribution is constantly below the approximate one. Then we can randomly accept the relatively cheap exponential variates with a probability of exp(-δ²/2).

We can just pick an equivalent algorithm in the literature. In the Devroye book, chapter IX page 382, there is some pseudo-code:

REPEAT generate independent exponential random variates X and Y UNTIL X² <= 2*ν₀²*Y

RETURN R <-- ν₀ + X/ν₀

for which a Numpy rendition could be written like this:

def getMySamplesXpRj(rawRanCount, mu, sigma, tau):
    nu0 = (math.log(tau) - mu) / sigma      # position of tau in standard Gaussian
    if (nu0 <= 0):
        print("Error: τ (tau) too small in getMySamplesXpRj")
    rnu0 = 1.0 / nu0

    xs  = nprd.exponential(1.0,  rawRanCount)   # exponential "raw" variates
    ys  = nprd.exponential(1.0,  rawRanCount)

    allSamples = nu0 + (rnu0 * xs)
    boolArray  = (xs*xs - 2*nu0*nu0*ys) <= 0.0
    samples    = allSamples[boolArray]

    ys  = np.exp(mu + sigma * samples)               # go back to x space
    return ys

According to Table 3 in the Botev-L'Ecuyer paper, the rejection rate of this algorithm is nicely low.

Besides, if you are willing to allow for some sophistication, there is also some literature about the Ziggurat algorithm as used for truncated Gaussian distributions, for example the 2012 arXiv 1201.6140 paper by Nicolas Chopin at ENSAE-CREST.

Side note: with recent versions of Python, it seems that you can use Greek letters for your variable names directly, σ instead of sigma, τ instead of tau, just as in the statistics books:

$ python3
Python 3.9.6 (default, Jun 29 2021, 00:00:00)
>>> 
>>> σ = 2
>>> τ = 7
>>> 
>>> στ = σ * τ
>>> 
>>> στ + 1
15
>>>