I'm doing some pymc3 and I would like to create custom Stochastics, however there doesn't seem to be a lot documentation about how it's done. I know how to use the as_op way, however apparently that makes it impossible to use the NUTS sampler, in which case I don't see the advantage of pymc3 over pymc.
The tutorial mentions that it can be done by inheriting from theano.Op. But can anyone show me how that would work (I'm still getting started on theano)? I have two Stochastics that I want to define.
The first one should be easier, it's an N dimension vector F
that has only constant parent variables:
with myModel:
F = DensityDist('F', lambda value: pymc.skew_normal_like(value, F_mu_array, F_std_array, F_a_array), shape = N)
I want a skew normal distribution, which doesn't seem to be implemented in pymc3 yet, I just imported the pymc2 version. Unfortunately, F_mu_array, F_std_array, F_a_array and F
are all N-dimensional vectors, and the lambda thing doesn't seem to work with an N-dimension list value
.
Firstly, is there a way to make the lambda input an N-dimensional array? If not, I guess I would need to define the Stochastic F
directly, and this is where I presume I need theano.Op to make it work.
The second example is a more complicated function of other Stochastics. Here how I want to define it (incorrectly at the moment):
with myModel:
ln2_var = Uniform('ln2_var', lower=-10, upper=4)
sigma = Deterministic('sigma', exp(0.5*ln2_var))
A = Uniform('A', lower=-10, upper=10, shape=5)
C = Uniform('C', lower=0.0, upper=2.0, shape=5)
sw = Normal('sw', mu=5.5, sd=0.5, shape=5)
# F from before
F = DensityDist('F', lambda value: skew_normal_like(value, F_mu_array, F_std_array, F_a_array), shape = N)
M = Normal('M', mu=M_obs_array, sd=M_stdev, shape=N)
# Radius forward-model (THIS IS THE STOCHASTIC IN QUESTION)
R = Normal('R', mu = R_forward(F, M, A, C, sw, N), sd=sigma, shape=N)
Where the function R_forward(F,M,A,C,sw,N)
is naively defined as:
from theano.tensor import lt, le, eq, gt, ge
def R_forward(Flux, Mass, A, C, sw, num):
for i in range(num):
if lt(Mass[i], 0.2):
if lt(Flux[i], sw[0]):
muR = C[0]
else:
muR = A[0]*log10(Flux[i]) + C[0] - A[0]*log10(sw[0])
elif (le(0.2, Mass[i]) or le(Mass[i], 0.5)):
if lt(Flux[i], sw[1]):
muR = C[1]
else:
muR = A[1]*log10(Flux[i]) + C[1] - A[1]*log10(sw[1])
elif (le(0.5, Mass[i]) or le(Mass[i], 1.5)):
if lt(Flux[i], sw[2]):
muR = C[2]
else:
muR = A[2]*log10(Flux[i]) + C[2] - A[2]*log10(sw[2])
elif (le(1.5, Mass[i]) or le(Mass[i], 3.5)):
if lt(Flux[i], sw[3]):
muR = C[3]
else:
muR = A[3]*log10(Flux[i]) + C[3] - A[3]*log10(sw[3])
else:
if lt(Flux[i], sw[4]):
muR = C[4]
else:
muR = A[4]*log10(Flux[i]) + C[4] - A[4]*log10(sw[4])
return muR
This presumably won't work of course. I can see how I would use as_op
, but I want to preserve the NUTS sampling.
I realize this is a bit late now, but I thought I'd answer the question (rather vaguely) anyways.
If you want to define a stochastic function (e.g. a probability distribution), then you need to do a couple of things:
First, define a subclass of either Discrete (pymc3.distributions.Discrete) or Continuous, which has at least the method logp, which returns the log-likelihood of your stochastic. If you define this as a simple symbolic equation (x+1), I believe you do not need to take care of any gradients (but don't quote me on this; see the documentation about this). I'll get on to more complicated cases below. In the unfortunate case that you need to do anything more complex, as in your second example (pymc3 now has a skew normal distribution implemented, by the way), you need to define the operations required for it (used in the logp method) as a Theano Op. If you need no derivatives, then the as_op does the job, but as you said, gradients are kind of the idea of pymc3.
This is where it gets complicated. If you want to use NUTS (or need gradients for whatever reason), then you need to implement your operation used in logp as a subclass of theano.gof.Op. Your new op class (let's call it just Op from now on) will need two or three methods at least. The first one defines inputs/outputs to the Op (check the Op documentation). The perform() method (or variants you might choose) is the one that does the operation you want (your R_forward function, for example). This can be done in pure python, if you so wish. The third method, grad(), is where you define the gradient of your perform()'s output wrt the inputs. The actual output to grad() is a bit different, but not a big deal.
And it is in grad() that using Theano pays off. If you define your entire perform() in Theano, then it might be that you can easily use automatic differentiation (theano.tensor.grad or theano.tensor.jacobian) to do the work for you (see the example below). However, this is not necessarily going to be easy.
In your second example, it would mean implementing your R_forward function in Theano, which could be complicated.
Here I include a somewhat minimal example of an Op that I created while learning to do these things.
def my_th_fun():
""" Some needed auxiliary functions.
"""
X = th.tensor.vector('X')
SCALE = th.tensor.scalar('SCALE')
X.tag.test_value = np.array([1,2,3,4])
SCALE.tag.test_value = 5.
Scale, upd_sm_X = th.scan(lambda x, scale: scale*(scale+ x),
sequences=[X],
outputs_info=[SCALE])
fun_Scale = th.function(inputs=[X, SCALE], outputs=Scale)
D_out_d_scale = th.tensor.grad(Scale[-1], SCALE)
fun_d_out_d_scale = th.function([X, SCALE], D_out_d_scale)
return Scale, fun_Scale, D_out_d_scale, fun_d_out_d_scale
class myOp(th.gof.Op):
""" Op subclass with a somewhat silly computation. It uses
th.scan and th.tensor.grad is used to calculate the gradient
automagically in the grad() method.
"""
__props__ = ()
itypes = [th.tensor.dscalar]
otypes = [th.tensor.dvector]
def __init__(self, *args, **kwargs):
super(myOp, self).__init__(*args, **kwargs)
self.base_dist = np.arange(1,5)
(self.UPD_scale, self.fun_scale,
self.D_out_d_scale, self.fun_d_out_d_scale)= my_th_fun()
def perform(self, node, inputs, outputs):
scale = inputs[0]
updated_scale = self.fun_scale(self.base_dist, scale)
out1 = self.base_dist[0:2].sum()
out2 = self.base_dist[2:4].sum()
maxout = np.max([out1, out2])
exp_out1 = np.exp(updated_scale[-1]*(out1-maxout))
exp_out2 = np.exp(updated_scale[-1]*(out2-maxout))
norm_const = exp_out1 + exp_out2
outputs[0][0] = np.array([exp_out1/norm_const, exp_out2/norm_const])
def grad(self, inputs, output_gradients): #working!
""" Calculates the gradient of the output of the Op wrt
to the input. As a simple example, the input is scalar.
Notice how the output is actually the gradient multiplied
by the output_gradients, which is an input provided by
theano when calculating gradients.
"""
scale = inputs[0]
X = th.tensor.as_tensor(self.base_dist)
# Do I need to recalculate all this or can I assume that perform() has
# always been called before grad() and thus can take it from there?
# In any case, this is a small enough example to recalculate quickly:
all_scale, _ = th.scan(lambda x, scale_1: scale_1*(scale_1+ x),
sequences=[X],
outputs_info=[scale])
updated_scale = all_scale[-1]
out1 = self.base_dist[0:1].sum()
out2 = self.base_dist[2:3].sum()
maxout = np.max([out1, out2])
exp_out1 = th.tensor.exp(updated_scale*(out1 - maxout))
exp_out2 = th.tensor.exp(updated_scale*(out2 - maxout))
norm_const = exp_out1 + exp_out2
d_S_d_scale = th.theano.grad(all_scale[-1], scale)
Jac1 = (-(out1-out2)*d_S_d_scale*
th.tensor.exp(updated_scale*(out1+out2 - 2*maxout))/(norm_const**2))
Jac2 = -Jac1
return Jac1*output_gradients[0][0]+ Jac2*output_gradients[0][1],
This Op can then be used inside the logp() method of a stochastic in pymc3:
import pymc3 as pm
class myDist(pm.distributions.Discrete):
def __init__(self, invT, *args, **kwargs):
super(myDist, self).__init__(*args, **kwargs)
self.invT = invT
self.myOp = myOp()
def logp(self, value):
return self.myOp(self.invT)[value]
I hope it helps any (hopeless) pymc3/theano newbie out there.