python machine-learning linear-regression large-data rolling-computation

linear regression on a dataset with rolling window

I have a dataset with this shape:

    IPS10   IPS11   IPS299  IPS12   IPS13   IPS18   IPS25   IPS32   IPS34   IPS38   ... UTL11   UTL15   UTL17   UTL21   UTL22   UTL29   UTL31   UTL32   UTL33   GDP
     0  3.040102    2.949695    3.319379    3.251798    4.525330    0.379066    2.731048     
     2.643842   2.453547    1.201144    ... 2.978505    -0.944465   3.585314    6.169364     
     -0.395442  0.433999    -0.350617   0.899361    1.312837    -1.328266
     ...    ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
    588 -3.126587   -3.576200   -3.512180   -2.411509   -4.629191   -0.391066    
    -3.902952   -2.169446   -3.584623   0.082130    ... -2.741805   -2.838139    
    -3.435455   -3.343945       -0.710171   1.862004    -1.025504   -0.128602    
    -0.204241   -0.345851

with its shape is like

(593, 144)

now i'd like to:

Split data into a training set (in-sample) and test set (o-o-s last 50 observations)
Use Linear regression to predict GDP(last column) and reestimate LR based on a rolling window forecasting method.

could you please help me? Thanks

Solution

import pandas as pd
from sklearn.linear_model import LinearRegression

# simulate data
n, m = 593, 144
df = pd.DataFrame(np.random.random((n, m)))
df.rename(columns={m - 1: 'GDP'}, inplace=True)

# split data into train / test and X / y
# assuming data ordered chronologically
test_size = 50
train, test = df[:-test_size], df[-test_size:]
X_train, y_train = train.drop(columns='GDP'), train['GDP']
X_test, y_test = test.drop(columns='GDP'), test['GDP']

# linear regression
window_size = 30
reestimation_frequency = 1
for idx in range(0, train.shape[0] - window_size, reestimation_frequency):
    X_window = X_train[idx:idx + window_size]
    y_window = y_train[idx:idx + window_size]
    reg = LinearRegression()
    reg.fit(X_window, y_window)
    # do sth with reg ...