I have a dataframe:
import yfinance as yf
df = yf.download('AAPL',
start='2001-01-01',
end='2005-12-31',
progress=False)
Then I split it into train-test sets with ratio 80:20. Here is some codes to check index of my train and test sets.
train_df.index
The output is
test_df.index
The output is
After got the model from training data, I do prediction with the 252 test data, and the result is
How to change the prediction output to be dataframe with datetime %Y%m%d index not integer number index like that? I have read many articles and answers in this stackoverflow, but I have not found the solution yet.
One thing you can do is to simply save the datetime index before model training/inference and then join it back on the RangeIndex.
i.e:
time_index = df.reset_index()[['utc']] #replace utc with your index name
df = df.reset_index()
train model and then join on the RangeIndex. Then set index back to the DatetimeIndex.
prediction = prediction.join(time_index)
prediction.set_index('utc', inplace=True)
Working example:
import pandas as pd
import numpy as np
df = pd.DataFrame({'col1':np.arange(10)}, index=pd.date_range('2021-01-01', '2021-01-10'))
df.index.name = 'Date'
#Save the time_index but indexed by RangeIndex to allow for join after prediction
time_index = df.reset_index()[['Date']]
#Some arbitrary prediction dataframe with a RangeIndex
prediction = pd.DataFrame({'predictions':np.arange(0,10)})
#joins prediction and time_index on the RangeIndex
prediction = prediction.join(time_index)
#Sets index to the time_index
prediction.set_index('Date', inplace=True)
You will now have a dataframe looking like this:
predictions
Date
2021-01-01 0
2021-01-02 1
2021-01-03 2
2021-01-04 3
2021-01-05 4
2021-01-06 5
2021-01-07 6
2021-01-08 7
2021-01-09 8
2021-01-10 9
Just to drive this home, here is a concrete example using your data source:
import yfinance as yf
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
df = yf.download('AAPL',
start='2001-01-01',
end='2005-12-31',
progress=False)
#Save the time_index but indexed by RangeIndex to allow for join after prediction
time_index = df.reset_index()[['Date']]
df = df.reset_index()
#Assuming we predict Volume
y = df[['Volume']]
X = df.drop(columns=['Volume', 'Date'])
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
model = LinearRegression()
model.fit(X_train, y_train)
#Predict values, transpose to fit into dataframe
predicted_values = model.predict(X_test).T[0]
#Create prediction dataframe
prediction = pd.DataFrame({'y-pred':predicted_values}, index=X_test.index)
#join test or true data to prediction for comparison
prediction = prediction.join(y_test)
#joins prediction and time_index on the RangeIndex
prediction = prediction.join(time_index)
#Sets index to the time_index
prediction.set_index('Date', inplace=True)
which results in:
y-pred Volume
Date
2001-07-26 3.893012e+08 369140800
2004-12-20 1.191681e+09 1168126400
2005-02-17 8.905975e+08 1518473600
2002-12-03 2.004725e+08 227869600
2005-10-10 8.430103e+08 50750560