Search code examples
rbigdatalimit

Estimating an OLS model in R with million observations and thousands of variables


I am trying to estimate a big OLS regression with ~1 million observations and ~50,000 variables using biglm.

I am planning to run each estimation using chunks of approximately 100 observations each. I tested this strategy with a small sample and it worked fine.

However, with the real data I am getting an "Error: protect(): protection stack overflow" when trying to define the formula for the biglm function.

I've already tried:

  • starting R with --max-ppsize=50000

  • setting options(expressions = 50000)

but the error persists

I am working on Windows and using Rstudio

# create the sample data frame (In my true case, I simply select 100 lines from the original data that contains ~1,000,000 lines)
DF <- data.frame(matrix(nrow=100,ncol=50000))
DF[,] <- rnorm(100*50000)
colnames(DF) <- c("y", paste0("x", seq(1:49999)))

# get names of covariates
my_xvars <- colnames(DF)[2:( ncol(DF) )]

# define the formula to be used in biglm
# HERE IS WHERE I GET THE ERROR :
my_f <- as.formula(paste("y~", paste(my_xvars, collapse = " + ")))

EDIT 1: The ultimate goal of my exercise is to estimate the average effect of all 50,000 variables. Therefore, simplifying the model selecting fewer variables is not the solution I am looking for now.


Solution

  • The first bottleneck (I can't guarantee there won't be others) is in the construction of the formula. R can't construct a formula that long from text (details are too ugly to explore right now). Below I show a hacked version of the biglm code that can take the model matrix X and response variable y directly, rather than using a formula to build them. However: the next bottleneck is that the internal function biglm:::bigqr.init(), which gets called inside biglm, tries to allocate a numeric vector of size choose(nc,2)=nc*(nc-1)/2 (where nc is the number of columns. When I try with 50000 columns I get

    Error: cannot allocate vector of size 9.3 Gb

    (2.3Gb are required when nc is 25000). The code below runs on my laptop when nc <- 10000.

    I have a few caveats about this approach:

    • you won't be able to handle a probelm with 50000 columns unless you have at least 10G of memory, because of the issue described above.
    • the biglm:::update.biglm will have to be modified in a parallel way (this shouldn't be too hard)
    • I have no idea if the p>>n issue (which applies at the level of fitting the initial chunk) will bite you. When running my example below (with 10 rows, 10000 columns), all but 10 of the parameters are NA. I don't know if these NA values will contaminate the results so that successive updating fails. If so, I don't know if there's a way to work around the problem, or if it's fundamental (so that you would need nr>nc for at least the initial fit). (It would be straightforward to do some small experiments to see if there is a problem, but I've already spent too long on this ...)
    • don't forget that with this approach you have to explicitly add an intercept column to the model matrix (e.g. X <- cbind(1,X) if you want one.

    Example (first save the code at the bottom as my_biglm.R):

    nr <- 10
    nc <- 10000
    DF <- data.frame(matrix(rnorm(nr*nc),nrow=nr))
    
    respvars <- paste0("x", seq(nc-1))
    names(DF) <- c("y", respvars)
    
    # illustrate formula problem: fails somewhere in 15000 < nc < 20000
    try(reformulate(respvars,response="y"))
    
    source("my_biglm.R")
    rr <- my_biglm(y=DF[,1],X=as.matrix(DF[,-1]))
    

    my_biglm <- function (formula, data, weights = NULL, sandwich = FALSE,
                          y=NULL, X=NULL, off=0) {
        if (!is.null(weights)) {
            if (!inherits(weights, "formula")) 
                stop("`weights' must be a formula")
            w <- model.frame(weights, data)[[1]]
        } else w <- NULL
        if (is.null(X)) {
            tt <- terms(formula)
            mf <- model.frame(tt, data)
            if (is.null(off <- model.offset(mf))) 
                off <- 0
            mm <- model.matrix(tt, mf)
            y <- model.response(mf) - off
        } else {
            ## model matrix specified directly
            if (is.null(y)) stop("both y and X must be specified")
            mm <- X
            tt <- NULL
        }
        qr <- biglm:::bigqr.init(NCOL(mm))
        qr <- biglm:::update.bigqr(qr, mm, y, w)
        rval <- list(call = sys.call(), qr = qr, assign = attr(mm, 
            "assign"), terms = tt, n = NROW(mm), names = colnames(mm), 
            weights = weights)
        if (sandwich) {
            p <- ncol(mm)
            n <- nrow(mm)
            xyqr <- bigqr.init(p * (p + 1))
            xx <- matrix(nrow = n, ncol = p * (p + 1))
            xx[, 1:p] <- mm * y
            for (i in 1:p) xx[, p * i + (1:p)] <- mm * mm[, i]
            xyqr <- update(xyqr, xx, rep(0, n), w * w)
            rval$sandwich <- list(xy = xyqr)
        }
        rval$df.resid <- rval$n - length(qr$D)
        class(rval) <- "biglm"
        rval
    }