Search code examples
pythondataframedatetimeindexingprediction

Index and Date Problem on Plot Prediction


I have a dataframe:

import yfinance as yf
df = yf.download('AAPL',
                 start='2001-01-01',
                 end='2005-12-31',
                 progress=False)

Then I split it into train-test sets with ratio 80:20. Here is some codes to check index of my train and test sets.

train_df.index

The output is

enter image description here

test_df.index

The output is

enter image description here

After got the model from training data, I do prediction with the 252 test data, and the result is

enter image description here

How to change the prediction output to be dataframe with datetime %Y%m%d index not integer number index like that? I have read many articles and answers in this stackoverflow, but I have not found the solution yet.


Solution

  • One thing you can do is to simply save the datetime index before model training/inference and then join it back on the RangeIndex.

    i.e:

    time_index = df.reset_index()[['utc']] #replace utc with your index name
    df = df.reset_index()
    

    train model and then join on the RangeIndex. Then set index back to the DatetimeIndex.

    prediction = prediction.join(time_index)
    prediction.set_index('utc', inplace=True)
    

    Working example:

    import pandas as pd
    import numpy as np
    
    df = pd.DataFrame({'col1':np.arange(10)}, index=pd.date_range('2021-01-01', '2021-01-10'))
    df.index.name = 'Date'
    #Save the time_index but indexed by RangeIndex to allow for join after prediction
    time_index = df.reset_index()[['Date']]
    
    #Some arbitrary prediction dataframe with a RangeIndex
    prediction = pd.DataFrame({'predictions':np.arange(0,10)})
    
    #joins prediction and time_index on the RangeIndex
    prediction = prediction.join(time_index)
    
    #Sets index to the time_index
    prediction.set_index('Date', inplace=True)
    

    You will now have a dataframe looking like this:

                predictions
    Date
    2021-01-01            0
    2021-01-02            1
    2021-01-03            2
    2021-01-04            3
    2021-01-05            4
    2021-01-06            5
    2021-01-07            6
    2021-01-08            7
    2021-01-09            8
    2021-01-10            9
    

    Just to drive this home, here is a concrete example using your data source:

    import yfinance as yf
    import pandas as pd
    import numpy as np
    from sklearn.linear_model import LinearRegression
    from sklearn.model_selection import train_test_split
    
    df = yf.download('AAPL',
                     start='2001-01-01',
                     end='2005-12-31',
                     progress=False)
    
    #Save the time_index but indexed by RangeIndex to allow for join after prediction
    time_index = df.reset_index()[['Date']]
    df = df.reset_index()
    
    #Assuming we predict Volume
    y = df[['Volume']]
    X = df.drop(columns=['Volume', 'Date'])
    
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
    
    model = LinearRegression()
    model.fit(X_train, y_train)
    
    #Predict values, transpose to fit into dataframe
    predicted_values = model.predict(X_test).T[0]
    
    #Create prediction dataframe
    prediction = pd.DataFrame({'y-pred':predicted_values}, index=X_test.index)
    
    #join test or true data to prediction for comparison
    prediction = prediction.join(y_test)
    
    #joins prediction and time_index on the RangeIndex
    prediction = prediction.join(time_index)
    
    #Sets index to the time_index
    prediction.set_index('Date', inplace=True)
    

    which results in:

    
                      y-pred      Volume
    Date
    2001-07-26  3.893012e+08   369140800
    2004-12-20  1.191681e+09  1168126400
    2005-02-17  8.905975e+08  1518473600
    2002-12-03  2.004725e+08   227869600
    2005-10-10  8.430103e+08   50750560