Search code examples
pythonmachine-learningtime-seriesstatsmodels

unclear how to get single vector of predictions from multivariate ARX in Statsmodels


As a disclaimer, I have very limited experience with using time-series models.

I am attempting to train an ARX model a year's worth of hourly energy data for a set of 23 buildings. I expect to get a single vector of predictions, given a set of timestamps that are within the training data. From this I can validate against my testing data that covers a subset of the same timestamps. I am attempting to use statsmodels as a VAR(p) model to do an initial attempt before adding in my exogenous terms. I presume this is a VAR model, as it is multivariate per timestamp. My attempt to use VARMAX with order(3,0) to create a VARX model resulted in a very long running model that did not work, so I fell back to a simple VAR model first.

My end goal is to fit a VARX model with the dataset below, as well as the average of each hour as the exogenous term. I would expect this would result in a single vector of parameters with a length equal to the lag term. I then would use this to predict a single vector of y_hat predictions for each row in my training data set. I can then compare the output of this to a subset of the the same hours from my testing data set.

My training dataset of normalized hourly energy data looks like the following (forgive the formatting after bldg5, it didnt copy well):

start_time          Bldg1               Bldg2               Bldg3               Bldg4                    Bldg5               Bldg7  Bldg8   Bldg9   Bldg10  Bldg11  Bldg12  Bldg13  Bldg14  Bldg15  Bldg16  Bldg17  Bldg18  Bldg19  Bldg20  Bldg21  Bldg22  Bldg23 
2014-01-05 00:00:00 0.2345679012345679  0.08234295415959253 0.02127659574468085 0.006535947712418301    0.3939393939393939  0.020325203252032523    0.034013605442176874    0.11003236245954694 0.013307984790874526    0.013513513513513514    0.06734006734006734 0.02840909090909091 0.3116883116883117  0.5301204819277109  0.03793103448275862 0.058064516129032254    0.3546511627906977  0.009523809523809523    0.47887323943661975 0.9228571428571428  0.04154302670623146 0.2773109243697479
2014-01-05 01:00:00 0.2345679012345679  0.07045840407470289 0.07092198581560284 0.006535947712418301    0.3939393939393939  0.04065040650406505 0.03741496598639456 0.07119741100323625 0.020912547528517112    0.013513513513513514    0.03367003367003367 0.02840909090909091 0.5194805194805195  0.4487951807228916  0.020689655172413793    0.06451612903225806 0.4476744186046512  0.009523809523809523    0.5014084507042254  0.6914285714285714  0.03560830860534124 0.2605042016806723
2014-01-05 02:00:00 0.2345679012345679  0.07555178268251274 0.056737588652482275    0.026143790849673203    0.3636363636363636  0.020325203252032523    0.03741496598639456 0.07119741100323625 0.011406844106463879    0.013513513513513514    0.04377104377104377 0.02840909090909091 0.4675324675324675  0.4728915662650603  0.017241379310344827    0.05161290322580645 0.436046511627907   0.009523809523809523    0.4732394366197183  0.66    0.03857566765578635 0.13165266106442577
2014-01-05 03:00:00 0.2345679012345679  0.07045840407470289 0.02127659574468085 0.006535947712418301    0.25757575757575757 0.036585365853658534    0.03741496598639456 0.07119741100323625 0.020912547528517112    0.010135135135135136    0.037037037037037035    0.02840909090909091 0.4285714285714286  0.39457831325301207 0.020689655172413793    0.08387096774193549 0.19767441860465118 0.006349206349206349    0.47887323943661975 0.7771428571428572  0.04154302670623146 0.16246498599439776
2014-01-05 04:00:00 0.2345679012345679  0.07045840407470289 0.02127659574468085 0.006535947712418301    0.2727272727272727  0.02845528455284553 0.030612244897959183    0.06796116504854369 0.011406844106463879    0.010135135135135136    0.03367003367003367 0.0625  0.3766233766233766  0.009036144578313253    0.020689655172413793    0.1032258064516129  0.0872093023255814  0.009523809523809523    0.49295774647887325 0.19714285714285712 0.03264094955489614 0.12324929971988796

. . . 

Testing dataset (also normalized hourly energy data) looks like the following (shape 168x1):

TestTime            TestBldg
2014-09-07 00:00:00 0.09427609427609428
2014-09-07 01:00:00 0.037037037037037035
2014-09-07 02:00:00 0.0404040404040404
2014-09-07 03:00:00 0.037037037037037035
2014-09-07 04:00:00 0.037037037037037035
. . . 

when I fit a a model like the following:

from statsmodels.tsa.api import VAR
var_mod = VAR(train_norm.iloc[:,3:])
var_res = var_mod.fit()
var_res.summary()

I get a lag coeff for every building in my dataset, which is unexpected. I also do not understand how to do a prediction on my fit model, var_res, as I would expect to do a var_res.predict() as you do with the univariate of AR, AutoReg()

(aside: what is the difference between forecast and predict functions in statsmodels. What is the difference between predict in the model before fitting, ie. var_mod and the fitted model, var_res.predict() ?)

Please let me know what additional clarifications I can provide.


Solution

  • The general VAR model allows for an effect of any of the K states within the previous p timesteps (using statsmodels' notation). It sounds like what you want is not necessarily VECTOR autoregression at all (b/c you don't want one building to rely on the previous timesteps of another building) but rather just an AR model using a "panel" dataset (multiple timeseries observations). I'm sure there is a way to do this in statsmodels, but honestly a quick search did not yield one. You could always construct your own version as a standard OLS model by pivoting the dataset so that each observation is a buildingXtime observation, and then creating new features that are the lagged energy uses from t-1 to t-p. Then you could simply run OLS on that. But there is probably a function to do this in sm that I am missing.

    As for your predict/forecast question, predict looks like it is a method of the VAR object (var_mod in your code) rather than VARResults (var_res in your code), and requires you to specify parameters. So this relies on your in-sample data and seems to only be able to accept prescribed parameters rather than use your fitted parameters. If you wanted to, you could do something like

    var_mod.predict(var_res.params, start=train_norm.index[p], end=train_norm.index[-1], lags=p)
    

    But I think what you're looking for is forecast, since you want to apply it to an arbitrary test building.