I am learning PyMC3 for Bayesian modeling. You can create a model and sample with:
import pandas as pd
import pymc3 as pm
# obs is a DataFrame with a single column, containing
# the observed values for variable height
obs = pd.DataFrame(...)
# we create a pymc3 model
with pm.Model() as m:
mu = pm.Normal('mu', mu=178, sd=20)
sigma = pm.Uniform('sigma', lower=0, upper=50)
height = pm.Normal('height', mu=mu, sd=sigma, observed=obs)
trace = pm.sample(1000, tune=1000)
pm.traceplot(trace)
When I check the trace
(in this case 1000 samples from the posterior probability), I notice that 2 chains are created:
>>> trace.nchains
2
I read the tutorial on PyMC3 and looked through the API but it is unclear to me what a chain represents (in this case I asked for 1000 samples from the posterior but I got 2 chains, each one with 1000 samples from the posterior).
Are the chains different runs of the sampler with the same parameters or do they have some other meaning/purpose?
A chain is a single run of MCMC. So if you have six 2-d parameters in your model and ask for 1000 samples, you will get six 2x1000 arrays for each chain.
When running MCMC, it is a best practice to use multiple chains, as they can help diagnose problems. For example, the Gelman-Rubin diagnostic requires multiple chains, and runs automatically (using joblib
, which tries to use multiple cores if possible) if you use more than 1 chain in PyMC3
.
As a concrete example of when you might want multiple chains, consider sampling from a multimodal distribution. Even the NUTS
sampler may not visit both modes in a single chain, but you could diagnose this using multiple chains.
Note that PyMC3
usually combines the chains when you work with them (e.g., using trace.get_values('my_var')
), since they are all valid MCMC samples. This does lead to some confusing behavior in that asking for 1000 samples actually gets you 4000 on most systems, where you get 4 chains by default.