Search code examples
rdataframematrixlogistic-regressionrstan

Will rstan fit models faster with data stored as matrix or data.frame?


I am fitting a series of multilevel logistic regressions using rstan via the map2stan function in the rethinking library. Everything works fine, and the models all fit correctly and converge. However, the datasets I am working with are very large, so runtimes to fit each model are quite long (on the order of days). Consequently, I'm looking for any potential speedups I can find.

Right now, my data is stored in data.frames that all have similar structure with scaled continuous variables and categorical variables split up into 0/1 dummies. For example:

> str(dcc.s.dummy)
'data.frame':   85604 obs. of  34 variables:
 $ COST_DIST_ECOTONE        : num  -0.594 -0.593 -0.596 -0.591 -0.591 ...
 $ COST_DIST_HEA            : num  -0.663 -0.66 -0.672 -0.652 -0.65 ...
 $ COST_DIST_HISTOSOLS      : num  -2.09 -2.09 -2.09 -2.09 -2.09 ...
 $ COST_DIST_MEDSTR         : num  -0.178 -0.176 -0.177 -0.176 -0.174 ...
 $ COST_DIST_RIV_COAST      : num  0.34 0.337 0.335 0.341 0.338 ...
 $ DEM30_ASP_RE_2           : num  0 0 0 0 0 1 0 0 0 0 ...
 $ DEM30_ASP_RE_3           : num  0 0 0 0 0 0 0 0 0 0 ...
 $ DEM30_ASP_RE_4           : num  1 0 0 1 0 0 0 0 0 1 ...
 $ DEM30_ASP_RE_5           : num  0 1 0 0 1 0 0 0 0 0 ...
 $ DEM30_M                  : num  2.19 2.19 2.2 2.18 2.19 ...
 $ DEM30_SLOPE              : num  -0.797 -0.782 -0.839 -0.817 -0.76 ...
 $ DRIFT_THICK_1            : num  0 0 0 0 0 0 0 0 0 0 ...
 $ DRIFT_THICK_2            : num  0 0 0 0 0 0 0 0 0 0 ...
 $ DRIFT_THICK_3            : num  1 1 1 1 1 1 1 1 1 1 ...
 $ DRIFT_THICK_4            : num  0 0 0 0 0 0 0 0 0 0 ...
 $ LOC_REL_RE               : num  -0.862 -0.857 -0.857 -0.845 -0.84 ...
 $ LOC_SD_SLOPE             : num  -1.08 -1.08 -1.08 -1.06 -1.06 ...
 $ SITE_NONSITE             : int  0 0 0 0 0 0 0 0 0 0 ...
 $ SSURGO_ESRI_DRAINAGE_RE_2: num  0 0 0 0 0 0 0 0 0 0 ...
 $ SSURGO_ESRI_DRAINAGE_RE_3: num  0 0 0 0 0 0 0 0 0 0 ...
 $ SSURGO_ESRI_DRAINAGE_RE_4: num  0 0 0 0 0 0 0 0 0 0 ...
 $ SSURGO_ESRI_DRAINAGE_RE_5: num  0 1 1 0 1 1 0 0 0 0 ...
 $ SSURGO_ESRI_DRAINAGE_RE_6: num  1 0 0 1 0 0 1 1 1 1 ...
 $ SSURGO_ESRI_DRAINAGE_RE_7: num  0 0 0 0 0 0 0 0 0 0 ...
 $ SSURGO_ESRI_EROSION_RE_2 : num  1 0 0 1 0 0 1 1 1 1 ...
 $ SSURGO_ESRI_EROSION_RE_3 : num  0 0 0 0 0 0 0 0 0 0 ...
 $ SSURGO_ESRI_EROSION_RE_4 : num  0 0 0 0 0 0 0 0 0 0 ...
 $ SSURGO_ESRI_EROSION_RE_5 : num  0 0 0 0 0 0 0 0 0 0 ...
 $ SSURGO_ESRI_LOC_DIV      : num  -0.184 -0.22 -0.168 -0.316 -0.322 ...
 $ SSURGO_ESRI_NATIVEVEG_2  : num  0 0 0 0 0 0 0 0 0 0 ...
 $ SSURGO_ESRI_NATIVEVEG_3  : num  0 0 0 0 0 0 0 0 0 0 ...
 $ SSURGO_ESRI_NATIVEVEG_4  : num  0 0 0 0 0 0 0 0 0 0 ...
 $ SSURGO_PH                : num  0.86 0.632 0.518 0.86 0.518 ...
 $ WATERSHED_INDEX          : int  3 3 3 3 3 3 3 3 3 3 ...

Would converting the data.frame to a matrix using data.matrix(frame, rownames.force = NA) or similar reduce the amount of time it takes rstan / map2stan to complete sampling and fit the model?

I've run across the argument in a number of places that operations performed on matrices are generally faster than those on data.frames. Rstan does all of it's heavy lifting in c++ though, so for all I know it's already doing a similar conversion anyway as part of its operations. Any insight or recommendations would be appreciated.


Solution

  • If the runtime is on the order of days and the compile time is about a minute, you are not going to notice the time it takes to first do anything on the R side, including whether the data were stored as a matrix or data.frame.

    In other words, in this situation you should worry much, much more about whether the Stan code generated my rethinking::map2stan is inefficient rather than whether the data processing code in rethinking is inefficient. Since rethinking is not optimized for your use case, it is quite likely that rstanarm, brms, or hand-written Stan code --- especially utilizing linear algebra rather than the more scalar algebra code generated by rethinking::map2stan --- will run a lot faster.