Search code examples
pythonnumpyscipyprobability-density

Creating a mixture of probability distributions for sampling


Is there a general way to join SciPy (or NumPy) probability distributions to create a mixture probability distribution which can then be sampled from?

I have such a distribution for display using something like:

mixture_gaussian = (norm.pdf(x_axis, -3, 1) + norm.pdf(x_axis, 3, 1)) / 2

which if then plotted looks like:

double gaussian

However, I can't sample from this generated model, as it's just a list of points which will plot as the curve.

Note, this specific distribution is just a simple example. I'd like to be able to generate several kinds of distributions (including "sub"-distributions which are not just normal distributions). Ideally, I would hope there would be someway for the function to be automatically normalized (i.e. not having to do the / 2 explicitly as in the code above.

Does SciPy/NumPy provide some way of easily accomplishing this?

This answer provides a way that such a sampling from a multiple distributions could be done, but it certainly requires a bit of handcrafting for a given mixture distribution, especially when wanting to weight different "sub"-distributions differently. This is usable, but I would hope for method that's a bit cleaner and straight forward if possible. Thanks!


Solution

  • Sampling from a mixture of distributions (where PDFs are added with some coefficients c_1, c_2, ... c_n) is equivalent to sampling each independently, and then, for each index, picking the value from k-th sample, with probability c_k.

    The latter, mixing, step can be efficiently done with numpy.random.choice. Here is an example where three distributions are mixed. The distributions are listed in distributions, and their coefficients in coefficients. There is a fat normal distribution, a uniform distribution, and a narrow normal distribution, with coefficients 0.5, 0.2, 0.3. The mixing happens at data[np.arange(sample_size), random_idx] after random_idx are generated according to given coefficients.

    import numpy as np
    import matplotlib.pyplot as plt
    
    distributions = [
        {"type": np.random.normal, "kwargs": {"loc": -3, "scale": 2}},
        {"type": np.random.uniform, "kwargs": {"low": 4, "high": 6}},
        {"type": np.random.normal, "kwargs": {"loc": 2, "scale": 1}},
    ]
    coefficients = np.array([0.5, 0.2, 0.3])
    coefficients /= coefficients.sum()      # in case these did not add up to 1
    sample_size = 100000
    
    num_distr = len(distributions)
    data = np.zeros((sample_size, num_distr))
    for idx, distr in enumerate(distributions):
        data[:, idx] = distr["type"](size=(sample_size,), **distr["kwargs"])
    random_idx = np.random.choice(np.arange(num_distr), size=(sample_size,), p=coefficients)
    sample = data[np.arange(sample_size), random_idx]
    plt.hist(sample, bins=100, density=True)
    plt.show()
    

    histogram