As a disclaimer, I have very limited experience with using time-series models.
I am attempting to train an ARX model a year's worth of hourly energy data for a set of 23 buildings. I expect to get a single vector of predictions, given a set of timestamps that are within the training data. From this I can validate against my testing data that covers a subset of the same timestamps. I am attempting to use statsmodels as a VAR(p) model to do an initial attempt before adding in my exogenous terms. I presume this is a VAR model, as it is multivariate per timestamp. My attempt to use VARMAX with order(3,0) to create a VARX model resulted in a very long running model that did not work, so I fell back to a simple VAR model first.
My end goal is to fit a VARX model with the dataset below, as well as the average of each hour as the exogenous term. I would expect this would result in a single vector of parameters with a length equal to the lag term. I then would use this to predict a single vector of y_hat
predictions for each row in my training data set. I can then compare the output of this to a subset of the the same hours from my testing data set.
My training dataset of normalized hourly energy data looks like the following (forgive the formatting after bldg5
, it didnt copy well):
start_time Bldg1 Bldg2 Bldg3 Bldg4 Bldg5 Bldg7 Bldg8 Bldg9 Bldg10 Bldg11 Bldg12 Bldg13 Bldg14 Bldg15 Bldg16 Bldg17 Bldg18 Bldg19 Bldg20 Bldg21 Bldg22 Bldg23
2014-01-05 00:00:00 0.2345679012345679 0.08234295415959253 0.02127659574468085 0.006535947712418301 0.3939393939393939 0.020325203252032523 0.034013605442176874 0.11003236245954694 0.013307984790874526 0.013513513513513514 0.06734006734006734 0.02840909090909091 0.3116883116883117 0.5301204819277109 0.03793103448275862 0.058064516129032254 0.3546511627906977 0.009523809523809523 0.47887323943661975 0.9228571428571428 0.04154302670623146 0.2773109243697479
2014-01-05 01:00:00 0.2345679012345679 0.07045840407470289 0.07092198581560284 0.006535947712418301 0.3939393939393939 0.04065040650406505 0.03741496598639456 0.07119741100323625 0.020912547528517112 0.013513513513513514 0.03367003367003367 0.02840909090909091 0.5194805194805195 0.4487951807228916 0.020689655172413793 0.06451612903225806 0.4476744186046512 0.009523809523809523 0.5014084507042254 0.6914285714285714 0.03560830860534124 0.2605042016806723
2014-01-05 02:00:00 0.2345679012345679 0.07555178268251274 0.056737588652482275 0.026143790849673203 0.3636363636363636 0.020325203252032523 0.03741496598639456 0.07119741100323625 0.011406844106463879 0.013513513513513514 0.04377104377104377 0.02840909090909091 0.4675324675324675 0.4728915662650603 0.017241379310344827 0.05161290322580645 0.436046511627907 0.009523809523809523 0.4732394366197183 0.66 0.03857566765578635 0.13165266106442577
2014-01-05 03:00:00 0.2345679012345679 0.07045840407470289 0.02127659574468085 0.006535947712418301 0.25757575757575757 0.036585365853658534 0.03741496598639456 0.07119741100323625 0.020912547528517112 0.010135135135135136 0.037037037037037035 0.02840909090909091 0.4285714285714286 0.39457831325301207 0.020689655172413793 0.08387096774193549 0.19767441860465118 0.006349206349206349 0.47887323943661975 0.7771428571428572 0.04154302670623146 0.16246498599439776
2014-01-05 04:00:00 0.2345679012345679 0.07045840407470289 0.02127659574468085 0.006535947712418301 0.2727272727272727 0.02845528455284553 0.030612244897959183 0.06796116504854369 0.011406844106463879 0.010135135135135136 0.03367003367003367 0.0625 0.3766233766233766 0.009036144578313253 0.020689655172413793 0.1032258064516129 0.0872093023255814 0.009523809523809523 0.49295774647887325 0.19714285714285712 0.03264094955489614 0.12324929971988796
. . .
Testing dataset (also normalized hourly energy data) looks like the following (shape 168x1):
TestTime TestBldg
2014-09-07 00:00:00 0.09427609427609428
2014-09-07 01:00:00 0.037037037037037035
2014-09-07 02:00:00 0.0404040404040404
2014-09-07 03:00:00 0.037037037037037035
2014-09-07 04:00:00 0.037037037037037035
. . .
when I fit a a model like the following:
from statsmodels.tsa.api import VAR
var_mod = VAR(train_norm.iloc[:,3:])
var_res = var_mod.fit()
var_res.summary()
I get a lag coeff for every building in my dataset, which is unexpected. I also do not understand how to do a prediction on my fit model, var_res
, as I would expect to do a var_res.predict()
as you do with the univariate of AR, AutoReg()
(aside: what is the difference between forecast and predict functions in statsmodels. What is the difference between predict in the model before fitting, ie. var_mod
and the fitted model, var_res.predict()
?)
Please let me know what additional clarifications I can provide.
The general VAR model allows for an effect of any of the K
states within the previous p
timesteps (using statsmodels' notation). It sounds like what you want is not necessarily VECTOR autoregression at all (b/c you don't want one building to rely on the previous timesteps of another building) but rather just an AR model using a "panel" dataset (multiple timeseries observations). I'm sure there is a way to do this in statsmodels, but honestly a quick search did not yield one. You could always construct your own version as a standard OLS model by pivoting the dataset so that each observation is a buildingXtime observation, and then creating new features that are the lagged energy uses from t-1
to t-p
. Then you could simply run OLS on that. But there is probably a function to do this in sm that I am missing.
As for your predict/forecast question, predict
looks like it is a method of the VAR
object (var_mod
in your code) rather than VARResults
(var_res
in your code), and requires you to specify parameters. So this relies on your in-sample data and seems to only be able to accept prescribed parameters rather than use your fitted parameters. If you wanted to, you could do something like
var_mod.predict(var_res.params, start=train_norm.index[p], end=train_norm.index[-1], lags=p)
But I think what you're looking for is forecast
, since you want to apply it to an arbitrary test building.