Search code examples
pythonbayesianpymc3

Updating model on PyMC3 with new observed data


I have measured the diameter of 80 fruits last year, and after checking what is the best distribution of the values, I've created a PyMC3 model

with Model() as diam_model:
    mu = Normal('mu',mu=57,sd=5.42)
    sigma = Uniform('sigma',0,10)

after, as far as I understand, I've "trained" the model with my prior data (the 80 values)

with diam_model:
    dist = Normal('dist',mu=mu,sd=sigma, observed=prior_data.values)

with diam_model:
    samples=fit().sample(1000)

then I used the plot_posteriorof the samples, returning also the mean and HPD.

My idea is to measure again this year using Bayesian update to reduce the sample size. How can I add single values, and update the posterior, expecting that the HPD gets smaller and smaller?


Solution

  • Kernel Density Estimated Updated Priors

    Using the other answer suggested as a duplicate, one could extract approximate versions of the priors using the code from this Jupyter notebook.

    First round

    I'll assume we have data from the first round of sampling, which we can impose the mean 57.0 and standard deviation of 5.42.

    import numpy as np
    import pymc3 as pm
    from sklearn.preprocessing import scale
    from scipy import stats
    
    # generate data forced to match distribution indicated
    Y0 = 57.0 + scale(np.random.normal(size=80))*5.42
    
    with pm.Model() as m0:
        # let's place an informed, but broad prior on the mean
        mu = pm.Normal('mu', mu=50, sd=10)
        sigma = pm.Uniform('sigma', 0, 10)
        
        y = pm.Normal('y', mu=mu, sd=sigma, observed=Y0)
        
        trace0 = pm.sample(5000, tune=5000)
    

    Extracting new priors from posterior

    We can then use the results of this model to extract KDE posteriors on the parameters with the following code from the referenced notebook:

    def from_posterior(param, samples, k=100):
        smin, smax = np.min(samples), np.max(samples)
        width = smax - smin
        x = np.linspace(smin, smax, k)
        y = stats.gaussian_kde(samples)(x)
        
        # what was never sampled should have a small probability but not 0,
        # so we'll extend the domain and use linear approximation of density on it
        x = np.concatenate([[x[0] - 3 * width], x, [x[-1] + 3 * width]])
        y = np.concatenate([[0], y, [0]])
        return pm.Interpolated(param, x, y)
    

    Second round

    Now, if we have more data we can run a new model using the KDE updated priors:

    Y1 = np.random.normal(loc=57, scale=5.42, size=100)
    
    with pm.Model() as m1:
        mu = from_posterior('mu', trace0['mu'])
        sigma = from_posterior('sigma', trace0['sigma'])
        
        y = pm.Normal('y', mu=mu, sd=sigma, observed=Y1)
        
        trace1 = pm.sample(5000, tune=5000)
    

    And similarly, one could use this trace to extract updated posterior estimates for future rounds of incoming data.

    Caveat Emptor

    The above methodology yields approximations to true updated priors and would be most useful in cases where conjugate priors are not possible. It should also be noted that I'm not sure the extent to which such KDE-base approximations introduce errors and how they propagate in a model when used repeatedly. It's a neat trick but one should be cautious about putting this into production without further validation of its robustness.

    In particular, I would be very concerned about situations where the posterior distributions have strong correlation structures. The code provided here generates a "prior" distribution using only the marginals of each latent variable. This seems fine for simple models like this, and admittedly the initial priors also lack correlations, so it may not be a huge issue here. Generally, however, summarizing to marginals involves discarding information about how variables are related, and in other contexts this could be rather significant. For example, the default parameterization of a Beta distribution always leads to correlated parameters in the posterior and thus the above technique would inappropriate. Instead, one would need to infer a multi-dimensional KDE that incorporates all the latent variables.


    Conjugate Model

    However, in your particular case the expected distribution is Gaussian and those distributions have established closed-form conjugate models, i.e., principled solutions rather than approximations. I strongly recommend working through Kevin Murphy's Conjugate Bayesian analysis of the Gaussian distribution.

    Normal-Inverse Gamma Model

    The Normal-Inverse Gamma model estimates both the mean and the variance of an observed normal random variable. The mean is modeled with a normal prior; the variance with an inverse gamma. This model uses four parameters for the prior:

    mu_0  = prior mean
    nu    = number of observations used to estimate the mean
    alpha = half the number of obs used to estimate variance
    beta  = half the sum of squared deviations
    

    Given your initial model, we could use the values

    mu_0  = 57.0
    nu    = 80
    alpha = 40
    beta  = alpha*5.42**2
    

    You can then plot the log-likelihood of the prior as follows:

    # points to compute likelihood at
    mu_grid, sd_grid = np.meshgrid(np.linspace(47, 67, 101), 
                                   np.linspace(4, 8, 101))
    
    # normal ~ N(X | mu_0, sigma/sqrt(nu))
    logN = stats.norm.logpdf(x=mu_grid, loc=mu_0, scale=sd_grid/np.sqrt(nu))
    
    # inv-gamma ~ IG(sigma^2 | alpha, beta)
    logIG = stats.invgamma.logpdf(x=sd_grid**2, a=alpha, scale=beta)
    
    # full log-likelihood
    logNIG = logN + logIG
    
    # actually, we'll plot the -log(-log(likelihood)) to get nicer contour
    plt.figure(figsize=(8,8))
    plt.contourf(mu_grid, sd_grid, -np.log(-logNIG))
    plt.xlabel("$\mu$")
    plt.ylabel("$\sigma$")
    plt.show()
    

    enter image description here

    Updating parameters

    Given new data, Y1, one updates the parameters as follows:

    # precompute some helpful values
    n = Y1.shape[0]
    mu_y = Y1.mean()
    
    # updated NIG parameters
    mu_n = (nu*mu_0 + n*mu_y)/(nu + n)
    nu_n = nu + n
    alpha_n = alpha + n/2
    beta_n = beta + 0.5*np.square(Y1 - mu_y).sum() + 0.5*(n*nu/nu_n)*(mu_y - mu_0)**2
    

    For the sake of illustrating change in the model, let's generate some data from a slightly different distribution and then plot the resulting posterior log-likelihood:

    np.random.seed(53211277)
    Y1 = np.random.normal(loc=62, scale=7.0, size=20)
    

    which yields

    enter image description here

    Here, the 20 observations are not enough to completely move to the new location and scale I provided, but both parameters appear shifted in that direction.