Search code examples
pythontime-seriesdatasetforecasting

Ideal Dataset Structure for Time Series Forecasting


I'm trying to do time series forecasting in Python.

Before I start doing it, I have some doubts in how we can Prepare source dataset.

Just want to understand the structure of data.

Let's say I have a department and in each department there are multiple Teams, I want to time series forecasting on Total Sales By each department.

I can prepare the data in the below options:

enter image description here

Most of the tutorials which I have seen online is using Option 2. But I prefer Option 1

Because in future if there are more new departments coming 1, then it can be added at the row level, whereas in Option-2 I need to add more and more columns each time.

My Question is :

  1. Can I use the structure in Option-1 for preparing my dataset?

  2. If Yes, in the Date column, I can see 1st June has 3 records for each team in a department. So is there any condition whether a row should have a date only once?

  3. In Option-1, Let's say I want to predict total sales By department. Will adding a addition column like Team Name have any impact while preparing models for time series forecasting?

I would be really glad if someone could help. Thanks in advance.


Solution

  • While making a forecast your data preparation will depend on what answers you are trying to find (don't get me wrong, I'm not saying you manipulate your preparation to get the answers you need). What I mean by this is, you say "I want to time series forecasting on Total Sales By each department". This would imply you don't care about the teams within a department. In that case it's not ideal to go for option-1, because to then get the total sales of any department you will have to perform some work to calculate it, instead of simply reading the value you need.

    However it is very common to have your source data in a more detailed level than in which you are going to use it. The key take-away here is that you are going to use python to read this data. Aggregating data to the level you need it, should be done in Python and it is absolutely fine to store it more detailed in for example a .csv file.

    To answer you questions:

    1. Yes you can definitely use Option-1 to store your data, it would also be my preferred way.
    2. There is no limitation on how many columns, rows or duplicates you can have in your data. Moreover, the more detailed your data is (more columns), the more rows will probably have a duplicate date value.
    3. If you only intend to make a distinction based on department and not on Team you can for example use the pandas library to aggregate your data on department after your read it. There is no use in keeping the detailed Team information at that point.

    You have good questions, but it is difficult to give a clear and complete answer on all of them. My advice would be to get any kind of result as quickly as possible while trying to be clear about the choices you make along the way. Then when you have your result you can finetune and review previous decisions. No forecasting model is every perfect (ever) or done in one try.