I am doing multinomial regression with glmnet in matlab, and have a dataset that is approximately 6-10GB, depending on how large I make the test set. I am able to load it into memory, but it seems that glmnetmex is unable to handle the entire dataset for larger training sizes (such as leave one out). I suspect there should be a way to batch the inputs to glmnetmex, but I can't seem to find it in the documentation. Does it exist, or do you have any recommendations with how to proceed otherwise? I'm fine using the R version instead if it has a way of addressing this issue
Being able to feed batches to a method requires two things:
glmnet
has the latter but, unfortunately, not the former. I think you have several options for approaching the problem:
glmnet
code. Both the MATLAB and the R packages are wrappers for the actual optimizer, which is written in FORTRAN. Both wrappers make fresh initialization of the model variables before passing them to the FORTRAN solver. You can try modifying it to use a pre-computed model.glmnet
model on each batch and use a weighted voting scheme (where each predictor is weighted by its cross-validation performance) to make final predictions.glmnet
. My package does allow you to initialize training with a pre-computed model, as well as run training for a fixed number of iterations. The down side is that I only have a binomial solver, not a multinomial one. So, you would have to hack it with a one-vs-one or one-vs-rest scheme.