'data.frame': 264833 obs. of 6 variables:
$ revenue : num 6 2.7 3.6 3 1.9 2.8 3.8 2.7 5.7 8.8 ...
$ pack_size : num 175 150 210 175 160 165 110 150 330 170 ...
$ life_stage: Factor w/ 7 levels "MIDAGE SINGLES/COUPLES",..: 7 7 6 6 4 1 7 7 2 7 ...
$ tier : Factor w/ 3 levels "Budget","Mainstream",..: 3 2 1 1 2 2 1 1 3 2 ...
$ month_year: Factor w/ 12 levels "Jul-2018","Aug-2018",..: 4 3 9 9 5 6 6 6 5 3 ...
$ brand : Factor w/ 29 levels "Burger","CCs",..: 14 18 9 14 29 3 11 19 7 7 ...
Above df is a sample df, original one consisting of around 2,000,000 I've been using 'Linear Regression Model' and 'ANOVA'.
for example - Linear Regression Model
lm_model <- lm(A ~ B * C * D * E + F, data = df)
lm_model <- lm(revenue ~ life_stage * tier * month_year* brand + pack_size, data = model_data)
Above calculation has been taking forever to Process..
So I tested sample size of the original df .
'data.frame': 26483 obs. of 6 variables:
$ revenue : num 6.6 8.6 9.2 9.2 11.8 7.6 7.6 7.6 8.8 6.6 ...
$ pack_size : num 175 250 150 270 380 110 110 110 170 190 ...
$ life_stage: Factor w/ 7 levels "MIDAGE SINGLES/COUPLES",..: 5 6 4 6 5 7 1 4 6 6 ...
$ tier : Factor w/ 3 levels "Budget","Mainstream",..: 1 1 2 1 1 3 2 1 3 2 ...
$ month_year: Factor w/ 12 levels "Jul-2018","Aug-2018",..: 11 1 11 11 2 2 5 11 6 10 ...
$ brand : Factor w/ 29 levels "Burger","CCs",..: 24 26 13 26 21 12 12 11 7 3 ...
Executed again;
lm_model <- lm(A ~ B * C * D * E , data = df_copy)
1. set.seed(123)
2. sample_indices <- sample(nrow(model_data), size = floor(0.1 * nrow(model_data)))
3. model_data_sample <- model_data[sample_indices, ]
4. lm_model_sample <- lm(revenue ~ life_stage * tier * month_year * brand + pack_size, data = model_data_sample)
5. predictions <- predict(lm_model_sample, newdata = model_data)
6. model_data$predicted_revenue <- predictions
at the 4th line of code, It took more than an hour to run this. I don't know what's wrong, I'm stuck here for 2 days straight.
running 5th line of code RStudio is forever stuck. While using sample size data
System using RAM upto 16gb and virtual memory upto 40gb.
System configuration - Ram 16 gb at 2993 mhz Gpu rtx 2060 Cpu amd 2700x at 4ghz
Your model includes the all two-way, three-way and four-way interactions between A
, B
, C
and D
. This is about ~7000 coefficients for those effects. The size of the model matrix is n
* p
, so for 2 million rows this matrix will have about 14 billion entries. Assuming 8 bytes per entry, that is more 100GB for just the model matrix that needs to be stored in RAM.
Fitting the model also scales poorly with that many coefficients.
Start by simplifying your model to only consider two-way interactions:
lm(A ~ (B + C + D + E) ^ 2 + F, data = df)
which only has < 50 coefficients.