Search code examples
pythonpandasforecasting

Converting a weekly forecast (Pandas df) into monthly format


I have a process which produces a dataframe containing a forecasting for products (and versions) in a weekly format (wc/ Monday dates - column names as strings). Example:

product     version     2021-06-07     2021-06-14     2021-06-21     2021-06-28

   a           1           500            400            300            200

   a           2           750            600            450            200

   b           1           200            150            100            100

   b           2           500            400            300            200

I've been asked to change the forecast into a monthly forecast instead of a weekly one. Example:

product     version       Jun-21         Jul-21         Aug-21         Sep-21

   a           1           350             x              x              x

   a           2           500             x              x              x
 
   b           1           100             x              x              x

   b           2           350             x              x              x

Numbers are for show - what I'm trying to do is average the weekly columns (for each row) to create monthly outputs but in an accurate fashion, i.e. if a weekly column was wc/ 26th Feb, then only 3 days worth will be included in the average for February and only 4 days for March.

I know that this is only a question of formatting / bucketing but I am struggling to come up with a solution as I've never had to do something like this before.

Not expecting a full solution but a point in the right direction for how I should approach the task would be much appreciated.


Solution

  • It's a bit of a process, since you need to calculate days in the month, identify which ones flow into the next month, do the math and shift them forward. This should do the trick.

    import pandas as pd
    import numpy as np
    
    df = pd.DataFrame({'product': ['a', 'a', 'b', 'b'],
     'version': [1, 2, 1, 2],
     '6/7/2021': [500, 750, 200, 500],
     '6/14/2021': [400, 600, 150, 400],
     '6/21/2021': [300, 450, 100, 300],
     '6/28/2021': [200, 200, 100, 200],
     })
    
    # Convert data to long format
    df = df.melt(id_vars=['product','version'], var_name='date')
    # Convert date to datetime object
    df['date'] = pd.to_datetime(df['date'])
    
    # Add 7 days to the day of the month to compare to the number of days in a month
    df['month_day'] = df['date'].dt.day + 7
    
    # Get the number of days in the month
    df['days_in_month'] = df['date'].dt.daysinmonth
    
    # Subtract to see how many days the current date would extend into the next month
    df['overrun'] = df['month_day']-df['days_in_month']
    
    # Calculate the percentage of the values to push forward into the next month
    df['push_forward'] = np.where(df['overrun']>0, df['value']/df['days_in_month']*df['overrun'], 0)
    
    # Reduce the current values by the amount to be pushed forward
    df['value'] = df['value'] - df['push_forward']
    
    # Copy the records with a push_forward value to a new dataframe
    df2 = df.loc[df['push_forward']>0].copy()
    
    # Drop push_foward column
    df.drop(columns='push_forward', inplace=True)
    
    # Add a week to the date values of records with a push_foward value
    df2['date'] = df2['date']+pd.DateOffset(weeks=1)
    
    # Merge the pushed data back to the original dataframe
    df = df.merge(df2[['product','version','date','push_forward']], on=['product','version','date'], how='outer')
    
    # Fill null values
    df.fillna(0, inplace=True)
    
    # Add the push forward values to their respective weekly values
    df['value'] = df['value'] + df['push_forward']
    
    # Convert date to just the month
    df['date'] = df['date'].dt.strftime('%Y-%m')
    
    # Group and take the average
    df = df.groupby(['product','version','date'])['value'].mean().reset_index()
    
    
    # # Create final pivot table
    df.pivot_table(index=['product','version'], columns='date', values='value')
    

    Output

                date       2021-06    2021-07
    product version     
          a        1    341.666667  33.333333
                   2    491.666667  33.333333
          b        1    133.333333  16.666667
                   2    341.666667  33.333333