python dataframe datetime indexing prediction

Index and Date Problem on Plot Prediction

I have a dataframe:

import yfinance as yf
df = yf.download('AAPL',
                 start='2001-01-01',
                 end='2005-12-31',
                 progress=False)

Then I split it into train-test sets with ratio 80:20. Here is some codes to check index of my train and test sets.

train_df.index

The output is

test_df.index

The output is

After got the model from training data, I do prediction with the 252 test data, and the result is

How to change the prediction output to be dataframe with datetime %Y%m%d index not integer number index like that? I have read many articles and answers in this stackoverflow, but I have not found the solution yet.

Solution

One thing you can do is to simply save the datetime index before model training/inference and then join it back on the RangeIndex.

i.e:

time_index = df.reset_index()[['utc']] #replace utc with your index name
df = df.reset_index()

train model and then join on the RangeIndex. Then set index back to the DatetimeIndex.

prediction = prediction.join(time_index)
prediction.set_index('utc', inplace=True)

Working example:

import pandas as pd
import numpy as np

df = pd.DataFrame({'col1':np.arange(10)}, index=pd.date_range('2021-01-01', '2021-01-10'))
df.index.name = 'Date'
#Save the time_index but indexed by RangeIndex to allow for join after prediction
time_index = df.reset_index()[['Date']]

#Some arbitrary prediction dataframe with a RangeIndex
prediction = pd.DataFrame({'predictions':np.arange(0,10)})

#joins prediction and time_index on the RangeIndex
prediction = prediction.join(time_index)

#Sets index to the time_index
prediction.set_index('Date', inplace=True)

You will now have a dataframe looking like this:

            predictions
Date
2021-01-01            0
2021-01-02            1
2021-01-03            2
2021-01-04            3
2021-01-05            4
2021-01-06            5
2021-01-07            6
2021-01-08            7
2021-01-09            8
2021-01-10            9

Just to drive this home, here is a concrete example using your data source:

import yfinance as yf
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

df = yf.download('AAPL',
                 start='2001-01-01',
                 end='2005-12-31',
                 progress=False)

#Save the time_index but indexed by RangeIndex to allow for join after prediction
time_index = df.reset_index()[['Date']]
df = df.reset_index()

#Assuming we predict Volume
y = df[['Volume']]
X = df.drop(columns=['Volume', 'Date'])

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

model = LinearRegression()
model.fit(X_train, y_train)

#Predict values, transpose to fit into dataframe
predicted_values = model.predict(X_test).T[0]

#Create prediction dataframe
prediction = pd.DataFrame({'y-pred':predicted_values}, index=X_test.index)

#join test or true data to prediction for comparison
prediction = prediction.join(y_test)

#joins prediction and time_index on the RangeIndex
prediction = prediction.join(time_index)

#Sets index to the time_index
prediction.set_index('Date', inplace=True)

which results in:


                  y-pred      Volume
Date
2001-07-26  3.893012e+08   369140800
2004-12-20  1.191681e+09  1168126400
2005-02-17  8.905975e+08  1518473600
2002-12-03  2.004725e+08   227869600
2005-10-10  8.430103e+08   50750560