Search code examples
pythonmachine-learningscikit-learnforecastingsklearn-pandas

how do I forecast data (in my case, rainfall) into the future after I have trained a model using scikit_learn and pandas?


I am training a model to predict rainfall data into the future. I have completed the training of the model already. I am using this dataset : https://www.kaggle.com/redikod/historical-rainfall-data-in-bangladesh It looks like this :

              Station   Yea  Month Day Rainfall dayofyear
1970-01-01  1   Dhaka   1970    1   1   0           1
1970-01-02  1   Dhaka   1970    1   2   0           2
1970-01-03  1   Dhaka   1970    1   3   0           3
1970-01-04  1   Dhaka   1970    1   4   0           4
1970-01-05  1   Dhaka   1970    1   5   0           5

I have completed the training using train and test data by using a code I found online as a reference. And then I have also checked the predicted value against true value.

Here is the code,

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import tensorflow as tf

#data is in local folder
df = pd.read_csv("data.csv")
df.head(5)

df.drop(df[(df['Day']>28) & (df['Month']==2) & (df['Year']%4!=0)].index,inplace=True)
df.drop(df[(df['Day']>29) & (df['Month']==2) & (df['Year']%4==0)].index,inplace=True)
df.drop(df[(df['Day']>30) & ((df['Month']==4)|(df['Month']==6)|(df['Month']==9)|(df['Month']==11))].index,inplace=True)

date = [str(y)+'-'+str(m)+'-'+str(d) for y, m, d in zip(df.Year, df.Month, df.Day)]
df.index = pd.to_datetime(date)
df['date'] = df.index
df['dayofyear']=df['date'].dt.dayofyear
df.drop('date',axis=1,inplace=True)

df.head()
df.size()
df.info()

df.plot(x='Year',y='Rainfall',style='.', figsize=(15,5))

train = df.loc[df['Year'] <= 2015]
test = df.loc[df['Year'] == 2016]
train=train[train['Station']=='Dhaka']
test=test[test['Station']=='Dhaka']

X_train=train.drop(['Station','StationIndex','dayofyear'],axis=1)
Y_train=train['Rainfall']
X_test=test.drop(['Station','StationIndex','dayofyear'],axis=1)
Y_test=test['Rainfall']

from sklearn import svm
from sklearn.svm import SVC
model = svm.SVC(gamma='auto',kernel='linear')
model.fit(X_train, Y_train)

Y_pred = model.predict(X_test)

df1 = pd.DataFrame({'Actual Rainfall': Y_test, 'Predicted Rainfall': Y_pred})  
df1[df1['Predicted Rainfall']!=0].head(10)

After this I tried to actually use the model to predict rainfall a few days/months/years into the future. I used a few, like some that are used for stock prices (after adjusting the code). But none of them seem to work. Since I already trained the model, i thought it would be easy to just forecast a few days in the future. Suppose, I trained with data from 1970-2015, tested with data from 2016. Now I want to predict what the rainfall will be in 2017. Something like that.

My question is, how can I do that in an intuitive way?

I'd be really grateful if someone can answer this question.

Edit @Mercury: This is the actual result after using that code. I doubt the model is running at all... This is the image of actual result : https://i.sstatic.net/81Vk1.png


Solution

  • I've noticed a very simple mistake here:

    X_train=train.drop(['Station','StationIndex','dayofyear'],axis=1)
    Y_train=train['Rainfall']
    X_test=test.drop(['Station','StationIndex','dayofyear'],axis=1)
    Y_test=test['Rainfall']
    

    You haven't dropped the Rainfall column from your training data.

    I'll make a bold assumption and say that you get a perfect 100% accuracy in both your training and testing, right? This is the reason. Your model sees that whatever is present in the 'Rainfall' column in the training data is always the answer, so it does exactly that during testing, thus getting a perfect result -- but in truth it's not predicting anything at all!

    Try running like this:

    X_train=train.drop(['Station','StationIndex','dayofyear','Rainfall'],axis=1)
    Y_train=train['Rainfall']
    X_test=test.drop(['Station','StationIndex','dayofyear','Rainfall'],axis=1)
    Y_test=test['Rainfall']
    
    from sklearn import svm
    model = svm.SVC(gamma='auto',kernel='linear')
    model.fit(X_train, Y_train)
    print('Accuracy on training set: {:.2f}%'.format(100*model.score(X_train, Y_train)))
    print('Accuracy on testing set: {:.2f}%'.format(100*model.score(X_test, Y_test)))