I am new to PyMC and trying to set up the simple conditional probability model: P(has_diabetes|bmi, race)
. Race can take on 5 discrete values encoded as 0-4 and BMI can take on a non-zero positive real number. So far I have the parent variables setup like this:
p_race = [0.009149232914923292,
0.15656903765690378,
0.019637377963737795,
0.013947001394700141,
0.800697350069735]
race = pymc.Categorical('race', p_race)
bmi_alpha = pymc.Exponential('bmi_alpha', 1)
bmi_beta = pymc.Exponential('bmi_beta', 1)
bmi = pymc.Gamma('bmi', bmi_alpha, bmi_beta, value=bmis, observed=True)
I have observed data that looks like:
| bmi | race | has_diabetes |
| 21.7 | 1 | 0 |
| 45.3 | 4 | 1 |
| 18.9 | 2 | 0 |
| 26.6 | 0 | 0 |
| 35.1 | 4 | 0 |
I am attempting to model has_diabetes
as:
has_diabetes = pymc.Bernoulli('has_diabetes', p_diabetes, value=data, observed=True)
My problem is that I am not sure how to construct the p_diabetes
function since it is dependent on the values of race
and and the continuous value of bmi
.
You need to construct a deterministic function that generates p_diabetes
as a function of your predictors. The safest way to do this is via a logit-linear transformation. For example:
intercept = pymc.Normal('intercept', 0, 0.01, value=0)
beta_race = pymc.Normal('beta_race', 0, 0.01, value=np.zeros(4))
beta_bmi = pymc.Normal('beta_bmi', 0, 0.01, value=0)
@pymc.deterministic
def p_diabetes(b0=intercept, b1=beta_race, b2=beta_bmi):
# Prepend a zero for baseline
b1 = np.append(0, b1)
# Logit-linear model
return pymc.invlogit(b0 + b1[race] + b2*bmi)
I would make the baseline race be the largest group (it is assumed to be index 0 in this example).
Actually, its not clear what the first part of the model above is for, specifically, why you are building models for the predictors, but perhaps I am missing something.