python tensorflow bayesian tensorflow-probability

Bayesian Linear Regression with Tensorflow Probability

I can't get Bayesian Linear Regression to work with Tensorflow Probability. Here's my code:

!pip install tensorflow==2.0.0-rc1
!pip install tensorflow-probability==0.8.0rc0

import numpy as np
import tensorflow as tf
import tensorflow_probability as tfp
tfd = tfp.distributions
N = 20
std = 1
m = np.random.normal(0, scale=5, size=1).astype(np.float32)
b = np.random.normal(0, scale=5, size=1).astype(np.float32)
x = np.linspace(0, 100, N).astype(np.float32)
y = m*x+b+ np.random.normal(loc=0, scale=std, size=N).astype(np.float32)

num_results = 10000
num_burnin_steps = 5000

def joint_log_prob(x, y, m, b, std):
  rv_m = tfd.Normal(loc=0, scale=5)
  rv_b = tfd.Normal(loc=0, scale=5)
  rv_std = tfd.HalfCauchy(loc=0., scale=2.)

  y_mu = m*x+b
  rv_y = tfd.Normal(loc=y_mu, scale=std)

  return (rv_m.log_prob(m) + rv_b.log_prob(b) + rv_std.log_prob(std)
          + tf.reduce_sum(rv_y.log_prob(y)))

# Define a closure over our joint_log_prob.
def target_log_prob_fn(m, b, std):
    return joint_log_prob(x, y, m, b, std)

@tf.function(autograph=False)
def do_sampling():
  kernel=tfp.mcmc.HamiltonianMonteCarlo(
          target_log_prob_fn=target_log_prob_fn,
          step_size=0.05,
          num_leapfrog_steps=3)
  kernel = tfp.mcmc.SimpleStepSizeAdaptation(
        inner_kernel=kernel, num_adaptation_steps=int(num_burnin_steps * 0.8))
  return tfp.mcmc.sample_chain(
      num_results=num_results,
      num_burnin_steps=num_burnin_steps,
      current_state=[
          0.01 * tf.ones([], name='init_m', dtype=tf.float32),
          0.01 * tf.ones([], name='init_b', dtype=tf.float32),
          1 * tf.ones([], name='init_std', dtype=tf.float32)
      ],
      kernel=kernel,
      trace_fn=lambda _, pkr: [pkr.inner_results.accepted_results.step_size,
                               pkr.inner_results.log_accept_ratio])

samples, [step_size, log_accept_ratio] = do_sampling()
m_posterior, b_posterior, std_posterior = samples

p_accept = tf.reduce_mean(tf.exp(tf.minimum(log_accept_ratio, 0.)))
print('Acceptance rate: {}'.format(p_accept))

n_v = len(samples)
true_values = [m, b, std]

plt.figure()
plt.title('Training data')
plt.plot(x, y)

plt.figure()
plt.title('Visualizing trace and posterior distributions')
for i, (sample, true_value) in enumerate(zip(samples, true_values)):
  plt.subplot(2*n_v, 2, 2*i+1)
  plt.plot(sample)
  plt.subplot(2*n_v, 2, 2*i+2)
  plt.hist(sample)
  plt.axvline(x=true_value)
>>> Acceptance rate: 0.006775229703634977

Any ideas?

Solution

It's surprising how hard such a simple problem can be! Raw HMC can be extremely sensitive to relatively small details in setting up inference. Systems like Stan attempt to handle this by doing a lot of tuning for you, but TFP's autotuning is as of now a lot more basic.

I found a few changes that seem to make inference work well here. Briefly they are:

Reparameterize in unconstrained space.
Use non-uniform step sizes (equivalently: add a diagonal preconditioner)
Increase the number of HMC leapfrog steps.
Use DualAveragingStepSizeAdaptation instead of SimpleStepSizeAdaptation.

The first trick is to use a TransformedTransitionKernel to reparameterize so that the scale parameter lives in an unconstrained space. E.g., we can use the Exp bijector to define a log-scale parameter:

tfb = tfp.bijectors
kernel = tfp.mcmc.TransformedTransitionKernel(
      inner_kernel=kernel,
      bijector=[tfb.Identity(), tfb.Identity(), tfb.Exp()]
  )

This ensures that inference only considers positive values for the scale, so it doesn't have to reject every move that takes the scale below zero. When I do this, the acceptance rates go way up, although the mixing still isn't great.

The second change is to use non-uniform step sizes for the three variables (this is equivalent to diagonal preconditioning). It looks like the posterior in this model is ill-conditioned: twenty datapoints determine the slope much more precisely than they determine the intercept or scale. "Step size adaptation" in TFP simply finds a step size such that the specified percentage of samples are accepted, which is generally governed by the most tightly constrained component of the posterior: if other components have much wider posteriors, the small step size will prevent them from mixing. One way to estimate reasonable step sizes is to use the standard deviations from variational inference with a factored normal surrogate posterior:

surrogate_posterior = tfp.experimental.vi.build_factored_surrogate_posterior(
    event_shape=[[], [], []],
    constraining_bijectors=[None, None, tfb.Exp()])
losses = tfp.vi.fit_surrogate_posterior(
    target_log_prob_fn, surrogate_posterior,
    optimizer=tf.optimizers.Adam(
        learning_rate=0.1,
        # Decay second-moment estimates to aid optimizing scale parameters.
        beta_2=0.9),
    num_steps=1000)
approximate_posterior_stddevs = [np.std(x) for x in surrogate_posterior.sample(50)]

Another general trick is to increase the number of leapfrog steps. One way to think about HMC is that within the leapfrog integrator it's similar to an optimizer with momentum, but it loses its momentum (by resampling) every time it stops to accept/reject. So in the extreme case where we do that every step (num_leapfrog_steps=1, i.e., Langevin dynamics) there is no momentum whatsoever, and increasing the number of leapfrog steps tends to improve the ability to navigate tricky geometry, similarly to how momentum improves optimizers. I didn't tune anything rigorously, but setting num_leapfrog_steps=16 instead of 3 seems to help a lot here.

Here's my modified version of your code, incorporating these tricks. It appears to mix pretty well on most executions (though I'm sure it's not perfect):

!pip install tensorflow==2.0.0-rc1
!pip install tensorflow-probability==0.8.0rc0

import numpy as np
import tensorflow as tf
import tensorflow_probability as tfp

tfd = tfp.distributions
tfb = tfp.bijectors

N = 20
std = 1
m = np.random.normal(0, scale=5, size=1).astype(np.float32)
b = np.random.normal(0, scale=5, size=1).astype(np.float32)
x = np.linspace(0, 100, N).astype(np.float32)
y = m*x+b+ np.random.normal(loc=0, scale=std, size=N).astype(np.float32)

num_results = 1000
num_burnin_steps = 500

def joint_log_prob(x, y, m, b, std):
  rv_m = tfd.Normal(loc=0, scale=5)
  rv_b = tfd.Normal(loc=0, scale=5)
  rv_std = tfd.HalfCauchy(0., scale=1.)

  y_mu = m*x+b
  rv_y = tfd.Normal(loc=y_mu, scale=rv_std[..., None])

  return (rv_m.log_prob(m) + rv_b.log_prob(b)
          + rv_std.log_prob(std)
          + tf.reduce_sum(rv_y.log_prob(y)))

# Define a closure over our joint_log_prob.
def target_log_prob_fn(m, b, std):
    return joint_log_prob(x, y, m, b, std)

# Run variational inference to initialize per-variable step sizes.
surrogate_posterior = tfp.experimental.vi.build_factored_surrogate_posterior(
    event_shape=[[], [], []],
    constraining_bijectors=[None, None, tfb.Exp()])
losses = tfp.vi.fit_surrogate_posterior(
    target_log_prob_fn,
    surrogate_posterior,
    optimizer=tf.optimizers.Adam(
        learning_rate=0.1,
        # Decay second-moment estimates to aid optimizing scale parameters.
        beta_2=0.9),
    num_steps=1000)
approximate_posterior_stddevs = [np.std(z) for z in surrogate_posterior.sample(50)]

@tf.function(autograph=False)
def do_sampling():
  kernel=tfp.mcmc.HamiltonianMonteCarlo(
          target_log_prob_fn=target_log_prob_fn,
          step_size=approximate_posterior_stddevs,
          num_leapfrog_steps=16)
  kernel = tfp.mcmc.TransformedTransitionKernel(
      inner_kernel=kernel,
      bijector=[tfb.Identity(),
                tfb.Identity(),
                tfb.Exp()]
  )
  kernel = tfp.mcmc.DualAveragingStepSizeAdaptation(
        inner_kernel=kernel,
        num_adaptation_steps=int(num_burnin_steps * 0.8))
  return tfp.mcmc.sample_chain(
      num_results=num_results,
      num_burnin_steps=num_burnin_steps,
      current_state=[
          0.01 * tf.ones([], name='init_m', dtype=tf.float32),
          0.01 * tf.ones([], name='init_b', dtype=tf.float32),
          1. * tf.ones([], name='init_std', dtype=tf.float32)
      ],
      kernel=kernel,
      trace_fn=lambda _, pkr: [pkr.inner_results.inner_results.accepted_results.step_size,
                               pkr.inner_results.inner_results.log_accept_ratio])

samples, [step_size, log_accept_ratio] = do_sampling()
m_posterior, b_posterior, std_posterior = samples

p_accept = tf.reduce_mean(tf.exp(tf.minimum(log_accept_ratio, 0.)))
print('Acceptance rate: {}'.format(p_accept))

n_v = len(samples)
true_values = [m, b, std]

plt.figure(figsize=(12, 12))
plt.title('Visualizing trace and posterior distributions')
for i, (sample, true_value) in enumerate(zip(samples, true_values)):
  plt.subplot(2*n_v, 2, 2*i+1)
  plt.plot(sample)
  plt.subplot(2*n_v, 2, 2*i+2)
  plt.hist(sample.numpy())
  plt.axvline(x=true_value)