Search code examples
pythonrandomboxplotnormal-distribution

Generate random values from boxplot


I have, let's say, an existing BoxPlot:

median: 5, 
q1: 2
q3: 6
5% percentile: 1
95% percentile: 2

I would like to generate 1,000,000 random values following this distribution.

Is there a way to do that?

I can generate skewed normal distributions, so another way would be to convert the boxplot values into the one of a skewed distribution, but given that the density changes with a change of alfa, I have no idea how to embark in that.


Solution

  • The most general way to generate a random variate following a distribution is as follows:

    • Generate a uniform random variate bounded by 0 and 1 (e.g., random.random()).
    • Take the inverse CDF (inverse cumulative distribution function) of that number.

    The result is a number that follows the distribution.

    In your case, you already have a good idea of how the inverse CDF (ICDF(x)) looks, since it's determined already by several of your parameters as follows:

    • ICDF(0.05) = 5th percentile
    • ICDF(0.25) = 1st quartile
    • ICDF(0.5) = median
    • ICDF(0.75) = 3rd quartile
    • ICDF(0.95) = 95th percentile

    However, you haven't determined the minimum and maximum values, which would correspond to ICDF(0) and ICDF(1), respectively; thus you would have to estimate them. You can then fill in the missing points of the inverse CDF by interpolation. The simplest example is linear interpolation, but other more complicated examples include fitting a curve or spline to the inverse CDF's points, such as a Catmull–Rom spline. Note, however, that properly speaking, the inverse CDF must be monotonically nondecreasing.

    On the other hand, if you have access to the underlying data points, rather than just a box plot, there are other methods you can use. Examples include kernel density estimations, histograms, or regression models (particularly for time series data). See also Generate random data based on existing data.


    The following shows examples:

    import numpy
    import scipy.interpolate as intrp
    # Generate 100 random values based on 5 percentiles, 
    # minimum, and maximum
    interp=intrp.interp1d([0.05,0.25,0.5, 0.75,0.95],[mn,p5,q1,p50,q3,p95,mx])
    values=interp(numpy.random.random(size=100))
    # Generate 100 random values based on 5 percentiles,
    # extrapolating at ends
    interp=intrp.interp1d([0.05,0.25,0.5, 0.75,0.95],
      [p5,q1,p50,q3,p95],fill_value="extrapolate")
    values=interp(numpy.random.random(size=100))