Why is using dot product worsening the performance for PyMC3?

I am trying to run a simple linear regression using PyMC3. The below code is a snippet:

import numpy as np
from pymc3 import Model, sample, Normal, HalfCauchy
import pymc3 as pm

X = np.arange(500).reshape(500, 1)
y = np.random.normal(0, 5, [500, 1]) + X

with Model() as multiple_regression_model:

    beta = Normal('beta', mu=0, sd=1000, shape=2)
    sigma = HalfCauchy('sigma', 1000)

    y_hat = beta[0] + X * beta[1]

    exp = Normal('y', y_hat, sigma=sigma, observed=y)

with multiple_regression_model:
    trace = sample(1000, tune=1000)

trace['beta'].mean(axis=0)

The above code runs in about 6 seconds and gives reasonable estimates for the betas ([-0.19646408, 1.00053091])

But when I try to use the dot product, things get really bad:

X = np.arange(500).reshape(500, 1)
y = np.random.normal(0, 5, [500, 1]) + X

X_aug_np = np.squeeze(np.dstack((np.ones((500, 1)), X)))

with Model() as multiple_regression_model:

    beta = Normal('beta', mu=0, sd=1000, shape=2)
    sigma = HalfCauchy('sigma', 1000)

    y_hat = pm.math.dot(X_aug_np, beta)

    exp = Normal('y', y_hat, sigma=sigma, observed=y)

with multiple_regression_model:
    trace = sample(1000, tune=1000)

trace['beta'].mean(axis=0)

Now the code finished in 56 seconds and the estimates are totally off ([249.52363555, -0.0000481 ]).

I thought using dot product will make things faster. Why is it behaving this way? Am I doing something wrong here?

Solution

This is a subtle shape and broadcasting bug: if you change the shape of beta to (2, 1), then it works.

To see why, I renamed the two models and tidied the code a bit:

import numpy as np
import pymc3 as pm

X = np.arange(500).reshape(500, 1)
y = np.random.normal(0, 5, [500, 1]) + X

X_aug_np = np.squeeze(np.dstack((np.ones((500, 1)), X)))

with pm.Model() as basic_model:
    beta = pm.Normal('beta', mu=0, sd=1000, shape=2)
    sigma = pm.HalfCauchy('sigma', 1000)

    y_hat =  beta[0] + X * beta[1]

    exp = pm.Normal('y', y_hat, sigma=sigma, observed=y)


with pm.Model() as matmul_model:
    beta = pm.Normal('beta', mu=0, sd=1000, shape=(2, 1))
    sigma = pm.HalfCauchy('sigma', 1000)

    y_hat = pm.math.dot(X_aug_np, beta)
    exp = pm.Normal('y', y_hat, sigma=sigma, observed=y)

How would you have found that out? Since it looked like the models were the same, but they were not sampling similarly, I ran

print(matmul_model.check_test_point())
print(basic_model.check_test_point())

which computes the log probability of the variables at a sensible default. This did not match up, so I checked exp.tag.test_value.shape, and found out it was (500, 500), when I expected it to be (500, 1). Shape handling is super hard in probabilistic programming, and this happened because exp broadcasts y_hat, sigma, and y together.

As an added problem, I could not get matmul_model to sample on my machine, without setting cores=1, chains=4.