I'm building a multifactorial sales forecasting model in R studio using xgBoost regression. I have built up lags with this function
Create_lags <- function(MyData, start_index_lag, num_lags) {
lags =seq(from =start_index_lag, to=start_index_lag+num_lags)
lag_names <- paste("lag", formatC(lags,width = nchar(max(lags)), flag="0"),
sep="_")
lag_functions <- setNames(paste("dplyr::lag(.,",lags,")"), lag_names)
print(lag_functions)
MyData= MyData %>%
arrange(Channel, Product)%>%
group_by(Channel, Product)%>%
mutate_at(vars(Sales), funs_(lag_functions))
print(colnames(MyData))
return(MyData)
}
and this works fine but then I have also built up rolling means and standard deviation with the below:
Create_rolling_window_means <- function(MyData,start_index_rollfeat, num_rollfeat){
rollmean_1 = seq(from =start_index_rollfeat, to= start_index_rollfeat+num_rollfeat)
rollmean_names <- paste("rollmean", formatC(rollmean_1,
width=nchar(max(rollmean_1)),flag="0"),
sep="")
rollmean_functions <- setNames(paste("lag(roll_meanr(.,",rollmean_1,")",",1)"), rollmean_names)
print(rollmean_functions)
MyData= MyData %>%
arrange(Channel, Product)%>%
group_by(Channel, Product)%>%
mutate_at(vars(Sales), funs_(rollmean_functions))
print(colnames(MyData))
return(MyData)
}
Create_rolling_window_sd <- function(MyData, start_index_rollfeat, num_rollfeat){
rollsd_1 = seq(from =start_index_rollfeat, to= start_index_rollfeat+num_rollfeat)
rollsd_names <- paste("rollsd", formatC(rollsd_1,
width=nchar(max(rollsd_1)),flag="0"),
sep="")
rollsd_functions <- setNames(paste("lag(roll_sdr(.,",rollsd_1,")",",1)"), rollsd_names)
print(rollsd_functions)
MyData= MyData %>%
arrange(Channel, Product)%>%
group_by(Channel, Product)%>%
mutate_at(vars(Sales), funs_(rollsd_functions))
print(colnames(MyData))
return(MyData)
}
this is working fine just for one future data point but I'm in the below situation, excel example rolling mean in 3 periods
so I can predict just one future data point, so what I think I need is to fix the function in order to use the predicted rolling mean as historic data, when I don't have the actual historic data point, in a loop, in order to fill up 45 future data points (45 days), something like the example below
My final result should be a unique column filled up with the values coming from the last column (exactly the same would be for standard deviation), which then I can use as a variable in my model. Just for additional context I'm using those values:
start_index_lag=4
num_lags=60
start_index_rollfeat=4
num_rollfeat=60
forecast_horizon = 45 #45 days
I achieved to make it recursive with dates[i]
Dates =seq(max(train$Date), by="day", length.out=45)
Dates
i=1
for (i in 2: length(Dates)) {
df_test <- MyDataTotal %>%
filter(Date <= Dates[i])%>%
group_by(Channel, Product)
#%>%
#filter(n() >13)}## to avoid items that are not enough size
#build the feature engineering for the unseen weeks
test_1 =df_test %>%
Create_AR_MA_feats(., start_index_lag, num_lags)
#, start_index_rollfeat, num_rollfeat)
#filter the unseen day features
test_Final = test_1[test_1$Date ==(Dates[i]),]