Search code examples
rdata.tablelm

Faster way to run a regression on large Data


I have a large dataset that 70k+ rows and several columns with variable data.Also i have a column that has over 5000 factors that i need to use.

Is there any way to speed up the regression as currently it takes over 40mins to run. The only way i think i speed it up would be if i could filter only the factors from the test data into the training data or use a data.table and run the reg from that.

Any help would greatly appreciated.

library(dbplyr)
library(dplyr)
library(data.table)
library(readr)


greys <- read_excel("Punting'/Dogs/greys.xlsx", sheet = 'Vic')
greys$name<- as.factor(greys$name)
ggtrain<- tail(greys,63000)
gtrain<- head(ggtrain, -190)
gtest1<- tail(ggtrain,190)
gtest<- filter(gtest1, runnum >5)

#mygrey<- gam(gtrain$time~ s(name, bs='fs')+s(box)+s(distance),data = gtrain,method = 'ML')
mygrey<- lm(gtrain$margin~name+box+distance+trate+grade+trackid, data = gtrain)
pgrey<- predict(mygrey,gtest)
gdf<- data.frame(gtest$name,pgrey)
#gdf
write.csv(gdf,'thedogs.csv')```

Classes ‘tbl_df’, ‘tbl’ and 'data.frame':   63000 obs. of  25 variables:
 $ position: num  4 5 6 7 1 2 3 4 5 6 ...
 $ box     : num  3 10 5 8 3 7 9 5 2 4 ...
 $ name    : Factor w/ 5903 levels "AARON'S ME BOY",..: 4107 2197 3294 3402 4766 4463 5477 274 5506 2249 ...
 $ trainer : chr  "Marcus Lloyd" "Ian Robinson" "Adam Richardson" "Nathan  Hunt" ...
 $ time    : num  22.9 23 23.1 23.5 22.5 ...
 $ margin  : num  7.25 8.31 9.96 15.33 0 ...
 $ split   : num  9.17 8.98 9.12 9.14 8.62 8.73 8.8 8.99 9.04 9.02 ...
 $ inrun   : num  75 44 56 67 11 22 33 54 76 67 ...
 $ weight  : num  27.9 26.2 30.3 27.7 26.5 31.5 34.1 32.8 31.2 34 ...
 $ sire    : chr  "Didda Joe" "Swift Fancy" "Barcia Bale" "Hostile" ...
 $ dam     : chr  "Hurricane Queen" "Ulla Allen" "Diva's Shadow" "Flashing Bessy" ...
 $ odds    : num  20.3 55.5 1.6 33.2 1.6 5 22.6 7.9 12.5 9.9 ...
 $ distance: num  390 390 390 390 390 390 390 390 390 390 ...
 $ grade   : num  4.5 4.5 4.5 4.5 4.5 4.5 4.5 4.5 4.5 4.5 ...
 $ race    : chr  "Race 11" "Race 11" "Race 11" "Race 11" ...
 $ location: chr  "Ballarat" "Ballarat" "Ballarat" "Ballarat" ...
 $ date    : chr  "Monday 5th of August 2019" "Monday 5th of August 2019" "Monday 5th of August 2019" "Monday 5th of August 2019" ...
 $ state   : chr  "vic" "vic" "vic" "vic" ...
 $ trate   : num  0.515 0.376 0.818 0.226 0.55 ...
 $ espeed  : num  75 44 56 67 11 22 33 54 76 67 ...
 $ trackid : num  3 3 3 3 3 3 3 3 3 3 ...
 $ runnum  : num  4 6 3 2 2 2 3 4 2 4 ...
 $ qms     : chr  "M/75" "M/44" "M/56" "M/67" ...



Solution

  • Your regression is fitting slow because of the name variable. Fitting a factor with 5903 levels will add 5903 columns to your design matrix - it will be like trying to fit 5903 separate variables.

    Your design matrix will have dimensions 63000x5908, which will one, take up a lot of memory and two, make lm work hard to produce its estimates (hence the 40 min fitting time).

    You have a few options:

    1. Keep your design as is, and wait (or find a slightly faster lm)
    2. Throw out the name variable, in which case lm will fit almost instantly
    3. Fit a mixed effects model, with name as a random effect, using lmer or other package. lmer in particular uses a sparse design matrix for the random effects, taking advantage of the fact that each observation can only have one of the 5903 names (so most of the matrix is empty).

    Of the three, the third option is likely the most principled way forward. A random effect will account for individual-level variation across observations, and also pool information among different individuals to help give better estimates for dogs that don't have a lot of observations. On top of that, it will compute quickly thanks to the sparse design matrix.

    A simple model for your dataset might look something like this:

    library(lme4)
    ## read data
    mygrey <- lmer(gtrain$margin~(1|name)+box+distance+trate+grade+trackid,
                   data = gtrain)
    

    If you want to go that route, I recommend reading more about mixed effects models so that you can choose the model structure that makes sense for your data. Here are two good resources: