Search code examples
pythonmachine-learninggoogle-colaboratoryforecasting

time series forecasting visit dates with customer classes graph not accurate


I am trying to do time series forecasting on a bunch of classes and date time but my graph looks like this for some reason my full code is below:

from google.colab import drive
drive.mount('/content/gdrive', force_remount = True)

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
import matplotlib.pyplot as plt

data = pd.read_csv('gdrive/My Drive/Colab_Notebooks/classproject/classdata.csv', parse_dates=['time_date'], index_col='time_date')
class_id = data['class_id']
time_date = data.index.date
data['date'] = data.index.date

class_id = data['class_id']
time_date = data.index.to_series()
m1 = class_id.ne(class_id.shift())
m2 = time_date.dt.date.ne(time_date.dt.date.shift())
data['count'] = data.groupby((m1 | m2).cumsum()).cumcount().add(1).values

out = data[data.groupby(data.index.date).transform('size').gt(1)]

!pip install pandas-datareader

import pandas_datareader.data as web
import datetime

import pandas as pd 
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import matplotlib.pyplot as plt
import seaborn as sns

sns.set()

plt.ylabel('Amount of classes')
plt.xlabel('Date')
plt.xticks(rotation=45)

out.index = pd.to_datetime(out['date'], format='%Y-%m-%d')
plt.plot(out.index, out['count'], )

results of plt

while the blog where I got this time series code from has this kind of result

enter image description here

So I'm not sure if I should proceed or not XD

my input data is this:
timestamp / class_id
2021-09-27 06:00:00 / A
2021-09-27 03:00:00 / A
2021-09-27 01:00:00 / A
2021-09-27 08:29:00 / C
2021-05-23 08:08:49 / B
2021-05-23 03:21:49 / B
2021-05-23 01:22:11 / C

after processing it and adding count and date columns:
count / timestamp / class_id / date
1 / 2021-09-27 06:00:00 / A / 2021-09-27
2 / 2021-09-27 03:00:00 / A / 2021-09-27
3 / 2021-09-27 01:00:00 / A / 2021-09-27
1 / 2021-09-27 08:29:00 / C / 2021-09-27
1 / 2021-05-23 08:08:49 / B / 2021-05-23
2 / 2021-05-23 03:21:49 / B / 2021-05-23
1 / 2021-05-23 01:22:11 / C / 2021-05-23

I tried a code below but for some reason the first graph is empty

plt.ylabel('Amount of classes')
plt.xlabel('date')
plt.xticks(rotation=45)

out.index = pd.to_datetime(out['date'], format='%Y-%m-%d')
out.groupby('class_id').plot()
plt.plot(out.index, out['count'], )

enter image description here


Solution

  • You are plotting all your class_id's at the same time. Try plotting by class using something like out.groupby('class_id').plot() to see if the plots per class make sense and look like you expect.