SP500 Prediction using LSTM

I wrote some code to predict sp500 using daily OHLCV data using LSTM machine learning algorithm. Here is the part of the data, data is from 2022-07-08 to 2023-07-08

Date,Open,High,Low,Close,Volume
2022-07-08,3888.260009765625,3918.5,3869.340087890625,3899.3798828125,3521620000
2022-07-11,3880.93994140625,3880.93994140625,3847.219970703125,3854.429931640625,3423480000
2022-07-12,3851.949951171875,3873.409912109375,3802.360107421875,3818.800048828125,3817210000

And here is the code

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import MinMaxScaler
from keras.models import Sequential
from keras.layers import LSTM, Dropout, Dense

# Step 1: Load and preprocess the data
data = pd.read_csv('spx_data.csv')
data['Date'] = pd.to_datetime(data['Date'])
data = data.set_index('Date')
close_prices = data['Close'].values.reshape(-1, 1)

# Normalize the data
scaler = MinMaxScaler(feature_range=(0, 1))
scaled_prices = scaler.fit_transform(close_prices)

# Prepare training data
look_back = 250  # Number of previous days to use for prediction
X_train, y_train = [], []
for i in range(look_back, len(scaled_prices)):
    X_train.append(scaled_prices[i-look_back:i, 0])
    y_train.append(scaled_prices[i, 0])
X_train, y_train = np.array(X_train), np.array(y_train)

# Reshape input data for LSTM [samples, time steps, features]
X_train = np.reshape(X_train, (X_train.shape[0], X_train.shape[1], 1))

# Step 2: Build the LSTM model
model = Sequential()
model.add(LSTM(units=50, return_sequences=True, input_shape=(X_train.shape[1], 1)))
model.add(Dropout(0.2))
model.add(LSTM(units=50, return_sequences=True))
model.add(Dropout(0.2))
model.add(LSTM(units=50))
model.add(Dropout(0.2))
model.add(Dense(units=1))

# Compile and train the model
model.compile(optimizer='adam', loss='mean_squared_error')
model.fit(X_train, y_train, epochs=50, batch_size=32)

# Step 3: Make predictions for the future 30 days
look_back_days = close_prices[-look_back:].reshape(1, -1)
scaled_predictions = []
for _ in range(30):
    scaled_prediction = model.predict(look_back_days)
    scaled_predictions.append(scaled_prediction[0, 0])
    look_back_days = np.append(look_back_days[:, 1:], scaled_prediction, axis=1)

# Inverse transform the predictions to get actual values
predictions = scaler.inverse_transform(np.array(scaled_predictions).reshape(-1, 1))

# Step 4: Plot historical data and predicted data
plt.figure(figsize=(12, 6))
plt.plot(data.index, close_prices, color='blue', label='Historical Data')
plt.plot(pd.date_range(start=data.index[-1], periods=30, freq='D'), predictions, color='red', linestyle='dashed',
         label='Predicted Data')
plt.xlabel('Date')
plt.ylabel('SPX Close Price')
plt.title('SPX Close Price Prediction')
plt.legend()
plt.grid(True)
plt.show()

Here is the prediction result

Obviously, it's not correct, it's impossible for sp500 to go to 4800+ next Monday. Anything wrong I did? how to fix it?

Solution

On this line:

look_back_days = close_prices[-look_back:].reshape(1, -1)

Remember that if you apply feature scaling during training, then you also have to apply feature scaling during inference, or your inputs will be 1000x as big as the model is expecting.

Therefore, scale your inputs before passing them to predict().

look_back_days = scaler.transform(close_prices[-look_back:]).reshape(1, -1)

Here's what the output looks like after making this change.