Search code examples
rmatlabglmnet

glmnet batching oversized dataset


I am doing multinomial regression with glmnet in matlab, and have a dataset that is approximately 6-10GB, depending on how large I make the test set. I am able to load it into memory, but it seems that glmnetmex is unable to handle the entire dataset for larger training sizes (such as leave one out). I suspect there should be a way to batch the inputs to glmnetmex, but I can't seem to find it in the documentation. Does it exist, or do you have any recommendations with how to proceed otherwise? I'm fine using the R version instead if it has a way of addressing this issue


Solution

  • Being able to feed batches to a method requires two things:

    • Ability to initialize the learning algorithm with a previous fit
    • Ability to run the learning algorithm for a limited number of iterations

    glmnet has the latter but, unfortunately, not the former. I think you have several options for approaching the problem:

    • Find a better machine. You may consider using one of the cloud services, if your financial resources allow for it.
    • Dig into the glmnet code. Both the MATLAB and the R packages are wrappers for the actual optimizer, which is written in FORTRAN. Both wrappers make fresh initialization of the model variables before passing them to the FORTRAN solver. You can try modifying it to use a pre-computed model.
    • You may consider building an ensemble predictor, where you train a separate glmnet model on each batch and use a weighted voting scheme (where each predictor is weighted by its cross-validation performance) to make final predictions.
    • I have an R package that provides a more general regularization framework, but it can also be used to train standard elastic net models as with glmnet. My package does allow you to initialize training with a pre-computed model, as well as run training for a fixed number of iterations. The down side is that I only have a binomial solver, not a multinomial one. So, you would have to hack it with a one-vs-one or one-vs-rest scheme.
    • Finally, if you are not attached to linear models, there are plenty of other learning methods that allow for easy batching of inputs. Deep learning and neural network frameworks are currently among the more popular ones.