Search code examples
linear-regressionpymc3

Why is linear regression so poor in training using PYMC3


I am new to PYMC3. Maybe this is a naive question, but I searched a lot and didn't find any explanation on this issue. Basically, I want to do a linear regression in PYMC3, however the training is very slow and the model performance on training set is very poor as well. Below is my code:

X_Tr = np.array([ 13.99802212,  13.8512075 ,  13.9531636 ,  13.97432944,
    13.89211468,  13.91357953,  13.95987483,  13.86476587,
    13.9501789 ,  13.92698143,  13.9653932 ,  14.06663115,
    13.91697969,  13.99629862,  14.01392784,  13.96495713,
    13.98697998,  13.97516973,  14.01048397,  14.05918188,
    14.08342002,  13.89350606,  13.81768849,  13.94942447,
    13.90465027,  13.93969029,  14.18771189,  14.08631113,
    14.03718829,  14.01836206,  14.06758363,  14.05243539,
    13.96287123,  13.93011351,  14.01616973,  14.01923812,
    13.97424024,  13.9587175 ,  13.85669845,  13.97778302,
    14.04192138,  13.93775494,  13.86693585,  13.79985956,
    13.82679677,  14.06474544,  13.90821822,  13.71648423,
    13.78899668,  13.76857337,  13.87201756,  13.86152949,
    13.80447525,  13.99609891,  14.0210165 ,  13.986906  ,
    13.97479211,  14.04562055,  14.03293095,  14.15178043,
    14.32413197,  14.2330354 ,  13.99247751,  13.92962912,
    13.95394525,  13.87888254,  13.82743111,  14.10724699,
    14.23638905,  14.15731881,  14.13239278,  14.13386722,
    13.91442452,  14.01056255,  14.19378649,  14.22233852,
    14.30405399,  14.25880108,  14.23985258,  14.21184303,
    14.4443183 ,  14.55710331,  14.42102092,  14.29047616,
    14.43712609,  14.58666212])
y_tr = np.array([ 13.704,  13.763,  13.654,  13.677,  13.66 ,  13.735,  13.845,
    13.747,  13.747,  13.606,  13.819,  13.867,  13.817,  13.68 ,
    13.823,  13.779,  13.814,  13.936,  13.956,  13.912,  13.982,
    13.979,  13.919,  13.944,  14.094,  13.983,  13.887,  13.902,
    13.899,  13.881,  13.784,  13.909,  13.99 ,  14.06 ,  13.834,
    13.778,  13.703,  13.965,  14.02 ,  13.992,  13.927,  14.009,
    13.988,  14.022,  13.754,  13.837,  13.91 ,  13.907,  13.867,
    14.014,  13.952,  13.796,  13.92 ,  14.051,  13.773,  13.837,
    13.745,  14.034,  13.923,  14.041,  14.077,  14.125,  13.989,
    14.174,  13.967,  13.952,  14.024,  14.171,  14.175,  14.091,
    14.267,  14.22 ,  14.071,  14.112,  14.174,  14.289,  14.146,
    14.356,  14.5  ,  14.265,  14.259,  14.406,  14.463,  14.473,
    14.413,  14.507])
sns.regplot(x=X_tr, y=y_tr.flatten());

enter image description here

Here I use PYMC3 to train the model:

shA_X = shared(X_tr)
with pm.Model() as linear_model:    
    alpha = pm.Normal("alpha", mu=14,sd=100)
    betas = pm.Normal("betas", mu=0, sd=100, shape=1)
    sigma = pm.HalfCauchy('sigma', beta=10, testval=1.)
    mu = alpha + betas * shA_X
    forecast = pm.Normal("forecast", mu=mu, sd=sigma, observed=y_tr)
    step = pm.NUTS()
    trace=pm.sample(3000, tune=1000)

and then check the performance:

ppc_w = pm.sample_ppc(trace, 1000, linear_model,
                    progressbar=False)
plt.plot(ppc_w['forecast'].mean(axis=0),'r')
plt.plot(y_tr, color='k')`

enter image description here

Why is the prediction on the training set so poor? Any suggestions and ideas are appreciated.


Solution

  • This model is doing a fine job - I think the confusion is over how to handle the PyMC3 objects (thank you for the easy to work with example, though!). In general, PyMC3 will be used to quantify the uncertainty in your model.

    For example, trace['betas'].mean() is around 0.83 (this will depend on your random seed), while the least squares estimate that, for example, sklearn gives will be 0.826. Similarly, trace['alpha'].mean() gives 2.34, while the "true" value is 2.38.

    You can also use the trace to plot many different plausible draws for your line of best fit:

    for draw in trace[::100]:
        pred = draw['betas'] * X_tr + draw['alpha']
        plt.plot(X_tr, pred, '--', alpha=0.2, color='grey')
    
    
    plt.plot(X_tr, y_tr, 'o');
    

    enter image description here

    Note that these are drawn from the distribution of "best fits" for your data. You also used sigma to model the noise, and you can plot this value as well:

    for draw in trace[::100]:
        pred = draw['betas'] * X_tr + draw['alpha']
    
        plt.plot(X_tr, pred, '-', alpha=0.2, color='grey')
        plt.plot(X_tr, pred + draw['sigma'], '-', alpha=0.05, color='red')
        plt.plot(X_tr, pred - draw['sigma'], '-', alpha=0.05, color='red');
    
    
    plt.plot(X_tr, y_tr, 'o');
    

    enter image description here

    Using sample_ppc draws observed values from your posterior distribution, so each row of ppc_w['forecast'] is a reasonable way for the data to be generated "next time". You might use that object this way:

    ppc_w = pm.sample_ppc(trace, 1000, linear_model,
                          progressbar=False)
    for draw in ppc_w['forecast'][::5]:
        sns.regplot(X_tr, draw, scatter_kws={'alpha': 0.005, 'color': 'r'}, fit_reg=False)
    sns.regplot(X_tr, y_tr, color='k', fit_reg=False);
    

    enter image description here