Search code examples
pythonmachine-learninglinear-regressionlarge-datarolling-computation

linear regression on a dataset with rolling window


I have a dataset with this shape:

    IPS10   IPS11   IPS299  IPS12   IPS13   IPS18   IPS25   IPS32   IPS34   IPS38   ... UTL11   UTL15   UTL17   UTL21   UTL22   UTL29   UTL31   UTL32   UTL33   GDP
     0  3.040102    2.949695    3.319379    3.251798    4.525330    0.379066    2.731048     
     2.643842   2.453547    1.201144    ... 2.978505    -0.944465   3.585314    6.169364     
     -0.395442  0.433999    -0.350617   0.899361    1.312837    -1.328266
     ...    ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
    588 -3.126587   -3.576200   -3.512180   -2.411509   -4.629191   -0.391066    
    -3.902952   -2.169446   -3.584623   0.082130    ... -2.741805   -2.838139    
    -3.435455   -3.343945       -0.710171   1.862004    -1.025504   -0.128602    
    -0.204241   -0.345851

with its shape is like

(593, 144)

now i'd like to:

  1. Split data into a training set (in-sample) and test set (o-o-s last 50 observations)
  2. Use Linear regression to predict GDP(last column) and reestimate LR based on a rolling window forecasting method.

could you please help me? Thanks


Solution

  • import pandas as pd
    from sklearn.linear_model import LinearRegression
    
    # simulate data
    n, m = 593, 144
    df = pd.DataFrame(np.random.random((n, m)))
    df.rename(columns={m - 1: 'GDP'}, inplace=True)
    
    # split data into train / test and X / y
    # assuming data ordered chronologically
    test_size = 50
    train, test = df[:-test_size], df[-test_size:]
    X_train, y_train = train.drop(columns='GDP'), train['GDP']
    X_test, y_test = test.drop(columns='GDP'), test['GDP']
    
    # linear regression
    window_size = 30
    reestimation_frequency = 1
    for idx in range(0, train.shape[0] - window_size, reestimation_frequency):
        X_window = X_train[idx:idx + window_size]
        y_window = y_train[idx:idx + window_size]
        reg = LinearRegression()
        reg.fit(X_window, y_window)
        # do sth with reg ...