Search code examples
rforecastinglmfacebook-prophet

Sales prediction and effect size regressors with Prophet or Lineair Regression


I am using Prophet and Linear Regression in order to:

  1. Predict sales on day level / store level;
  2. Understand the effect size of my regressors (x variables).

I don’t necessarily want to stick to these modelling techniques.

Now I’m facing the issue that if I model each store separately, the number of observations will decrease (and hence I am losing degrees of freedom). However, if I aggregate all stores - and model them at once – I expect that the model will not fit very well. Furthermore, if I aggregate the sales of these stores, the biggest stores will have a heavier weighting factor in the group.

Eventually, I need a prediction on store level. However, I would like to use all stores to determine the effect size of my external repressors.

My data is 4 years of sales data on day level from 100 stores. The additional regressor is the depth of discount (in percentages). See below an example of how my data looks like:

> head(data)
        Date Sales_EUR Store_ID Discount_depth
1 2017-01-01       101        1           0.10
2 2017-01-01       105        2           0.12
3 2017-01-01       104        3           0.11
4 2017-01-01       200        4           0.09
5 2017-01-01       170        5           0.10
6 2017-01-01       150        6           0.12

Does anyone have a solution or best practice for this issue?

Many thanks in advance.


Solution

  • I'm not familiar with Prophet but I think, based on what you've said, it sounds like you want to perform a linear mixed-effects (random-effects) model which can account for differences between and within stores. This will help with overall sales prediction but also individual store sales. I've created some sample data based on yours and provided a very basic RE model with Store_id as a RE.

    library(dplyr) 
    library(zoo) #create season variable
    library(nlme) #random effects
    
    set.seed(10)
    df<-data.frame(Date = rep(seq.Date(from =as.Date("01/01/2016", "%d/%m/%Y"), 
                                   to=as.Date("01/01/2020", "%d/%m/%Y"), by="day"), times = 100), 
                   Sales_EUR = rnorm(146200, 150, 25),
                   Store_ID = rep(1:100, each = 1462),
                   Discount_depth = rnorm(146200, 0.10, 0.01))
    df <- df %>% 
      dplyr::arrange(Store_ID, Date)
    
    #create season variable to try to capture seasonality, month as factor might suffice?
    yq <- as.yearqtr(as.yearmon(df$Date, "%d/%m/%Y") + 1/12)
    df$Season <- factor(format(yq, "%q"), levels = 1:4, 
                        labels = c("winter", "spring", "summer", "fall"))
    head(df)  
            Date Sales_EUR Store_ID Discount_depth Season
    1 2016-01-01  150.4687        1     0.08615730 winter
    2 2016-01-02  145.3937        1     0.10361614 winter
    3 2016-01-03  115.7167        1     0.09962170 winter
    4 2016-01-04  135.0208        1     0.08624449 winter
    5 2016-01-05  157.3636        1     0.10553382 winter
    6 2016-01-06  159.7449        1     0.08965313 winter
    

    From this dataset, I then ran a simple RE model:

    #i presume from your question you want to predict "Sales_EUR"?
    #basic random-effects model using library(nlme)
    m1 <- lme(Sales_EUR ~  Discount_depth + Season,
              random = ~ 1 | Store_ID,
              data = df,
              na.action = "na.omit") 
    summary(m1)
    
    Linear mixed-effects model fit by REML
     Data: df 
          AIC     BIC    logLik
      1355776 1355845 -677880.8
    
    Random effects:
     Formula: ~1 | Store_ID
            (Intercept) Residual
    StdDev:   0.2288529 24.97051
    
    Fixed effects: Sales_EUR ~ Discount_depth + Season 
                       Value Std.Error     DF   t-value p-value
    (Intercept)    151.22885  0.666889 146096 226.76767  0.0000
    Discount_depth -13.77271  6.532148 146096  -2.10845  0.0350
    Seasonspring     0.02474  0.184847 146096   0.13382  0.8935
    Seasonsummer    -0.01271  0.184847 146096  -0.06875  0.9452
    Seasonfall       0.20315  0.185349 146096   1.09603  0.2731
     Correlation: 
                   (Intr) Dscnt_ Ssnspr Ssnsmm
    Discount_depth -0.980                     
    Seasonspring   -0.143  0.003              
    Seasonsummer   -0.142  0.002  0.504       
    Seasonfall     -0.140  0.001  0.503  0.503
    
    Standardized Within-Group Residuals:
             Min           Q1          Med           Q3          Max 
    -4.530496999 -0.674513460  0.000275551  0.676791346  4.162294311 
    
    Number of Observations: 146200
    Number of Groups: 100 
    

    You will want to play around with the model but this is the basic idea to get you going. From this you can then start to try and make predictions, see here for help.