I am using Prophet and Linear Regression in order to:
I don’t necessarily want to stick to these modelling techniques.
Now I’m facing the issue that if I model each store separately, the number of observations will decrease (and hence I am losing degrees of freedom). However, if I aggregate all stores - and model them at once – I expect that the model will not fit very well. Furthermore, if I aggregate the sales of these stores, the biggest stores will have a heavier weighting factor in the group.
Eventually, I need a prediction on store level. However, I would like to use all stores to determine the effect size of my external repressors.
My data is 4 years of sales data on day level from 100 stores. The additional regressor is the depth of discount (in percentages). See below an example of how my data looks like:
> head(data)
Date Sales_EUR Store_ID Discount_depth
1 2017-01-01 101 1 0.10
2 2017-01-01 105 2 0.12
3 2017-01-01 104 3 0.11
4 2017-01-01 200 4 0.09
5 2017-01-01 170 5 0.10
6 2017-01-01 150 6 0.12
Does anyone have a solution or best practice for this issue?
Many thanks in advance.
I'm not familiar with Prophet but I think, based on what you've said, it sounds like you want to perform a linear mixed-effects (random-effects) model which can account for differences between and within stores. This will help with overall sales prediction but also individual store sales. I've created some sample data based on yours and provided a very basic RE model with Store_id
as a RE.
library(dplyr)
library(zoo) #create season variable
library(nlme) #random effects
set.seed(10)
df<-data.frame(Date = rep(seq.Date(from =as.Date("01/01/2016", "%d/%m/%Y"),
to=as.Date("01/01/2020", "%d/%m/%Y"), by="day"), times = 100),
Sales_EUR = rnorm(146200, 150, 25),
Store_ID = rep(1:100, each = 1462),
Discount_depth = rnorm(146200, 0.10, 0.01))
df <- df %>%
dplyr::arrange(Store_ID, Date)
#create season variable to try to capture seasonality, month as factor might suffice?
yq <- as.yearqtr(as.yearmon(df$Date, "%d/%m/%Y") + 1/12)
df$Season <- factor(format(yq, "%q"), levels = 1:4,
labels = c("winter", "spring", "summer", "fall"))
head(df)
Date Sales_EUR Store_ID Discount_depth Season
1 2016-01-01 150.4687 1 0.08615730 winter
2 2016-01-02 145.3937 1 0.10361614 winter
3 2016-01-03 115.7167 1 0.09962170 winter
4 2016-01-04 135.0208 1 0.08624449 winter
5 2016-01-05 157.3636 1 0.10553382 winter
6 2016-01-06 159.7449 1 0.08965313 winter
From this dataset, I then ran a simple RE model:
#i presume from your question you want to predict "Sales_EUR"?
#basic random-effects model using library(nlme)
m1 <- lme(Sales_EUR ~ Discount_depth + Season,
random = ~ 1 | Store_ID,
data = df,
na.action = "na.omit")
summary(m1)
Linear mixed-effects model fit by REML
Data: df
AIC BIC logLik
1355776 1355845 -677880.8
Random effects:
Formula: ~1 | Store_ID
(Intercept) Residual
StdDev: 0.2288529 24.97051
Fixed effects: Sales_EUR ~ Discount_depth + Season
Value Std.Error DF t-value p-value
(Intercept) 151.22885 0.666889 146096 226.76767 0.0000
Discount_depth -13.77271 6.532148 146096 -2.10845 0.0350
Seasonspring 0.02474 0.184847 146096 0.13382 0.8935
Seasonsummer -0.01271 0.184847 146096 -0.06875 0.9452
Seasonfall 0.20315 0.185349 146096 1.09603 0.2731
Correlation:
(Intr) Dscnt_ Ssnspr Ssnsmm
Discount_depth -0.980
Seasonspring -0.143 0.003
Seasonsummer -0.142 0.002 0.504
Seasonfall -0.140 0.001 0.503 0.503
Standardized Within-Group Residuals:
Min Q1 Med Q3 Max
-4.530496999 -0.674513460 0.000275551 0.676791346 4.162294311
Number of Observations: 146200
Number of Groups: 100
You will want to play around with the model but this is the basic idea to get you going. From this you can then start to try and make predictions, see here for help.