Search code examples
pythonmachine-learningscikit-learnlinear-regression

Linear regression for time series


I am pretty new to Machine Learning and have some confusion, so sorry for trivial question. I have time series data set, very simple with two columns - Date and Price. I'm predicting the price and want to add some features to my model like moving average for last 10 days. If I split dataset learn:validation 80:20. For the first 80 days I can calculate moving avergage. What about my validation set? Should I use predicted value as input for moving average? Are there ready implementation for such a solution? I'm using python scikit-learn library.


Solution

  • Ok, here is a solution using 250 data points of GOOG stock Close historical data. I have explained the code with comments. Please feel free to ask if there is something vague in there. As you can see, I use pandas and within that library is a convenience function "rolling" that computes, among other things, rolling means. I split the data set by hand, but it can also be done by e.g. sklearn.model_selection.train_test_split

    import pandas as pd
    from sklearn.linear_model import LinearRegression
    import numpy as np
    
    # Read data from file
    df = pd.read_csv("GOOG.csv")
    
    # Calculate 10 day rolling mean and drop first 10 rows because we cannot calculate rolling mean for them
    # shift moves the averages one step ahead so day 10 gets moving average of days 0-9, etc...
    df["Rolling_10d_close"] = df['Close'].rolling(10).mean().shift(1)
    df = df.dropna()
    
    # Split data into training and validation sets
    training_last_row = int(len(df) * 0.8)
    training_data = df.iloc[:training_last_row]
    validation_data = df.iloc[training_last_row:]
    
    # Train model on training set of data
    x = training_data["Rolling_10d_close"].to_numpy().reshape(-1, 1)
    y = training_data["Close"].to_numpy().reshape(-1, 1)
    
    reg = LinearRegression().fit(x, y)
    print(reg.coef_, reg.intercept_)
    # prints [[0.95972717]] [4.14010503]
    
    # Test the performance of predictions on the validation data set
    x_pred = validation_data["Rolling_10d_close"].to_numpy().reshape(-1, 1)
    y_pred = validation_data["Close"].to_numpy().reshape(-1, 1)
    
    print(reg.score(x_pred, y_pred))
    # prints 0.02467230502090556