I'm struggling to understand how observed data works in pymc3. From the information I've found so far, these two examples have been the most helpful for getting me as far as I have, but I can't get my model to work.
As an example of what I'm trying to do, say I have records from customers at a restaurant, recording the temperature of the day on a categorical rating scale out of 5, and whether or not they ordered a main meal, a side, or a beverage. I've set up some mock data like so:
import numpy as np
import pymc3 as pm
from theano import shared
no_of_root_categories = 5
no_of_samples = 1000
hot_day = np.random.randint(no_of_root_categories, size=no_of_samples)
option_labels = ["Main", "Side", "Beverage"]
meal_options = np.random.randint(2, size=(len(option_labels), no_of_samples))
I want to model it as a simple Bayesian network, like so:
where the shaded nodes are observed.
This is what I've got:
with pm.Model() as crashing_model:
shape = (no_of_samples, no_of_root_categories)
alpha = (1 / no_of_root_categories) * np.ones(shape)
root_prior = pm.Dirichlet("Temperature Rating Prior", alpha, shape=shape)
root = pm.Categorical('Temperature Rating', p=root_p, shape=no_of_samples, observed=hot_day)
for item, label in enumerate(option_labels):
node_data = meal_options[item, :]
theano_probs = shared(np.array(node_probs))
node_prior = pm.Beta(f"{label} Prior",
mu=root,
sigma=root,
shape=no_of_samples,
testval=np.random.randint(1, size=no_of_samples))
pm.Binomial(label, p=node_prior, n=no_of_samples, observed=node_data)
which works, but when I try
with crashing_model:
trace = pm.sample(1000, random_seed=0)
Python exits with a 'Bad initial energy' error.
I can create a model which seems to work without the latent variables
with pm.Model() as working_model: # seems to work
root_values = [np.where(hot_day == i)[0].tolist() for i in range(no_of_root_categories)]
root_p = [len(i) / 1000 for i in root_values]
root = pm.Categorical('Temperature Rating', p=root_p)
shared_proportions = shared(np.array([len(hot_day[i]) for i in root_values]))
for item, label in enumerate(option_labels):
node_probs = [sum([meal_options[item, idx] for idx in category]) / len(category) for category in root_values]
theano_probs = shared(np.array(node_probs))
pm.Binomial(label, p=theano_probs[root], n=shared_proportions[root])
but I'm not sure how to translate what I've done there to work with the latent variables. Any help would be appreciated.
With some help from the pymc3 discourse I got it working. I'd somehow gotten the idea that I needed to make sure all the shapes were the same, but I didn't need to specify them at all. Code below for anyone else who has trouble.
with pm.Model() as model:
shape = (no_of_samples, no_of_root_categories)
alpha = np.ones(no_of_root_categories)
root_prior = pm.Dirichlet("Temperature Rating Prior", alpha)
root = pm.Categorical('Temperature Rating', p=root_p, observed=hot_day)
for item, label in enumerate(option_labels):
node_data = meal_options[item, :]
theano_alpha_probs = shared(np.ones(no_of_root_categories))
theano_beta_probs = shared(np.ones(no_of_root_categories))
node_prior = pm.Beta(f"{label} Prior",
alpha=theano_alpha_probs[root],
beta=theano_beta_probs[root],
testval=0.5)
pm.Bernoulli(label, p=node_prior, observed=node_data)