Sales prediction and effect size regressors with Prophet or Lineair Regression

I am using Prophet and Linear Regression in order to:

Predict sales on day level / store level;
Understand the effect size of my regressors (x variables).

I don’t necessarily want to stick to these modelling techniques.

Now I’m facing the issue that if I model each store separately, the number of observations will decrease (and hence I am losing degrees of freedom). However, if I aggregate all stores - and model them at once – I expect that the model will not fit very well. Furthermore, if I aggregate the sales of these stores, the biggest stores will have a heavier weighting factor in the group.

Eventually, I need a prediction on store level. However, I would like to use all stores to determine the effect size of my external repressors.

My data is 4 years of sales data on day level from 100 stores. The additional regressor is the depth of discount (in percentages). See below an example of how my data looks like:

> head(data)
        Date Sales_EUR Store_ID Discount_depth
1 2017-01-01       101        1           0.10
2 2017-01-01       105        2           0.12
3 2017-01-01       104        3           0.11
4 2017-01-01       200        4           0.09
5 2017-01-01       170        5           0.10
6 2017-01-01       150        6           0.12

Does anyone have a solution or best practice for this issue?

Many thanks in advance.

Solution

I'm not familiar with Prophet but I think, based on what you've said, it sounds like you want to perform a linear mixed-effects (random-effects) model which can account for differences between and within stores. This will help with overall sales prediction but also individual store sales. I've created some sample data based on yours and provided a very basic RE model with Store_id as a RE.

library(dplyr) 
library(zoo) #create season variable
library(nlme) #random effects

set.seed(10)
df<-data.frame(Date = rep(seq.Date(from =as.Date("01/01/2016", "%d/%m/%Y"), 
                               to=as.Date("01/01/2020", "%d/%m/%Y"), by="day"), times = 100), 
               Sales_EUR = rnorm(146200, 150, 25),
               Store_ID = rep(1:100, each = 1462),
               Discount_depth = rnorm(146200, 0.10, 0.01))
df <- df %>% 
  dplyr::arrange(Store_ID, Date)

#create season variable to try to capture seasonality, month as factor might suffice?
yq <- as.yearqtr(as.yearmon(df$Date, "%d/%m/%Y") + 1/12)
df$Season <- factor(format(yq, "%q"), levels = 1:4, 
                    labels = c("winter", "spring", "summer", "fall"))
head(df)  
        Date Sales_EUR Store_ID Discount_depth Season
1 2016-01-01  150.4687        1     0.08615730 winter
2 2016-01-02  145.3937        1     0.10361614 winter
3 2016-01-03  115.7167        1     0.09962170 winter
4 2016-01-04  135.0208        1     0.08624449 winter
5 2016-01-05  157.3636        1     0.10553382 winter
6 2016-01-06  159.7449        1     0.08965313 winter

From this dataset, I then ran a simple RE model:

#i presume from your question you want to predict "Sales_EUR"?
#basic random-effects model using library(nlme)
m1 <- lme(Sales_EUR ~  Discount_depth + Season,
          random = ~ 1 | Store_ID,
          data = df,
          na.action = "na.omit") 
summary(m1)

Linear mixed-effects model fit by REML
 Data: df 
      AIC     BIC    logLik
  1355776 1355845 -677880.8

Random effects:
 Formula: ~1 | Store_ID
        (Intercept) Residual
StdDev:   0.2288529 24.97051

Fixed effects: Sales_EUR ~ Discount_depth + Season 
                   Value Std.Error     DF   t-value p-value
(Intercept)    151.22885  0.666889 146096 226.76767  0.0000
Discount_depth -13.77271  6.532148 146096  -2.10845  0.0350
Seasonspring     0.02474  0.184847 146096   0.13382  0.8935
Seasonsummer    -0.01271  0.184847 146096  -0.06875  0.9452
Seasonfall       0.20315  0.185349 146096   1.09603  0.2731
 Correlation: 
               (Intr) Dscnt_ Ssnspr Ssnsmm
Discount_depth -0.980                     
Seasonspring   -0.143  0.003              
Seasonsummer   -0.142  0.002  0.504       
Seasonfall     -0.140  0.001  0.503  0.503

Standardized Within-Group Residuals:
         Min           Q1          Med           Q3          Max 
-4.530496999 -0.674513460  0.000275551  0.676791346  4.162294311 

Number of Observations: 146200
Number of Groups: 100

You will want to play around with the model but this is the basic idea to get you going. From this you can then start to try and make predictions, see here for help.