Search code examples
pythonnumpyscipydistributionmontecarlo

Is there a fast alternative to scipy _norm_pdf for correlated distribution sampling?


I have fit a series of SciPy continuous distributions for a Monte-Carlo simulation and am looking to take a large number of samples from these distributions. However, I would like to be able to take correlated samples, such that the ith sample takes the e.g., 90th percentile from each of the distributions.

In doing this, I've found a quirk in SciPy performance:

# very fast way to many uncorrelated samples of length n
for shape, loc, scale, in distro_props:
    sp.stats.norm.rvs(*shape, loc=loc, scale=scale, size=n)

# verrrrryyyyy slow way to take correlated samples of length n
correlate = np.random.uniform(size=n)
for shape, loc, scale, in distro_props:
    sp.stats.norm.ppf(correlate, *shape, loc=loc, scale=scale)

Most of the results about this claim that the slowness on these SciPy distros if from the type-checking etc. wrappers. However when I profiled the code, the vast bulk of the time is spent in the underlying math function [_continuous_distns.py:179(_norm_pdf)]1. Furthermore, it scales with n, implying that it's looping through every elemnt internally.

The SciPy docs on rv_continuous almost seem to suggest that the subclass should override this for performance, but it seems bizarre that I would monkeypatch into SciPy to speed up their ppf. I would just compute this for the normal from the ppf formula, but I also use lognormal and skewed normal, which are more of a pain to implement.

So, what is the best way in Python to compute a fast ppf for normal, lognormal, and skewed normal distributions? Or more broadly, to take correlated samples from several such distributions?


Solution

  • Ultimately, this issue was caused by my use of the skew-normal distribution. The ppf of the skew-normal actually does not have a closed-form analytic definition, so in order to compute the ppf, it fell back to scipy.continuous_rv's numeric approximation, which involved iteratively computing the cdf and using that to zero in on the ppf value. The skew-normal pdf is the product of the normal pdf and normal cdf, so this numeric approximation called the normal's pdf and cdf many many times. This is why when I profiled the code, it looked like the normal distribution was the problem, not the SKU normal. The other answer to this question was able to achieve time savings by skipping type-checking, but didn't actually make a difference on the run-time growth, just a difference on small-n runtimes.

    To solve this problem, I have replaced the skew-normal distribution with the Johnson SU distribution. It has 2 more free parameters than a normal distribution so it can fit different types of skew and kurtosis effectively. It's supported for all real numbers, and it has a closed-form ppf definition with a fast implementation in SciPy. Below you can see example Johnson SU distributions I've been fitting from the 10th, 50th, and 90th percentiles.

    Example distributions