Search code examples
pythonpandasmatplotlibstatsmodelsarima

NaN in output when trying to use an ARIMA model


Construct a graph of the value of the Russian ruble against the Egyptian pound based on data at the end of trading. Select the best ARIMA model, predict further exchange rate values ​​based on it, display it on a graph, and draw conclusions.

Here is a task that I need to do, but when I try to output the final values, I encounter NaN. I don’t know how to solve it, here is the code:

import matplotlib.pyplot as plt
import pandas as pd
import csv
import statsmodels.api as sm
from statsmodels.tsa.stattools import adfuller

def tundra():
    print('1 - Output')
    print('2 - Downloading)
    oi = input('Choose option: ')
    if oi == '1':
        plt.show()
    elif oi == '2':
        po = input('Enter a name for the saved file: ')
        plt.savefig(f'{po}.jpg')
        print(f'Graph saved as {po}.jpg')
    else:
        print('Let's do it again')
        tundra()

dates = []
rub_to_nok = []

with open('D:\\Teleegram dw\\mfdexport_1month_01012014_30082024.csv', 'r') as csvfile:
    plots = csv.reader(csvfile, delimiter=';')
    next(plots)
    for row in plots:
        if row[0] == 'NOK':
            dates.append(pd.to_datetime(int(row[2]), format='%Y%m%d'))
            rub_to_nok.append(float(row[6]))

data = pd.DataFrame({'Date': dates, 'RUB_to_NOK': rub_to_nok})
data.set_index('Date', inplace=True)

print("Checking for NaN in data:")
print(data.isnull().sum())
data.dropna(inplace=True)

print(data.info())
print(data.head())

result = adfuller(data['RUB_to_NOK'])
print('Dickey-Fuller test:')
print('Statistics: %f' % result[0])
print('p-value: %f' % result[1])
print('Critical values:')
for key, value in result[4].items():
    print(f'    {key}: {value}')

model = sm.tsa.ARIMA(data['RUB_to_NOK'], order=(3, 1, 0))
model_fit = model.fit()

print(model_fit.summary())

forecast = model_fit.forecast(steps=30)
forecast_index = pd.date_range(start=data.index[-1] + pd.Timedelta(days=1), periods=30, freq='D')

forecast_df = pd.DataFrame(forecast, index=forecast_index, columns=['Forecast'])

print("Predicted values:")
for index, row in forecast_df.iterrows():
    print(f"Date: {index.strftime('%Y-%m-%d')}, Prognosis: {row['Forecast']:.4f}")

plt.figure(figsize=(25, 12), facecolor='grey')
plt.gca().set_facecolor('black')
plt.plot(data.index, data['RUB_to_NOK'], label='Cost RUB/NOK', color='white', alpha=0.5, linewidth=2)
plt.plot(forecast_df.index, forecast_df['Forecast'], label='Prediction (AR(3))', color='blue', linewidth=2, linestyle='--')
plt.title('RUB to NOK value with prediction (AR(3))')
plt.xlabel('Date')
plt.ylabel('Cost in CZK')
plt.legend()
plt.grid()

tundra()

Here is the csv file:

https://drive.google.com/file/d/1W-81mTJUqLHB73_2zLV5bPG-3d3oaO4T/view?usp=sharing

I tried to check the data when checking, but it is normal. I don't know how to draw an adequate conclusion.


Solution

  • The issue is in index that you are using in the dataframe. Basically model by default returns pd.Series with dates thus you don't need to create your custom index. Just convert series to frame and rename

    forecast = model_fit.forecast(steps=30)
    forecast_df = forecast.to_frame()
    forecast_df.columns = ['Forecast']
    
    print("Predicted values:")
    for index, row in forecast_df.iterrows():
        print(f"Date: {index.strftime('%Y-%m-%d')}, Prognosis: {row['Forecast']:.4f}")