I'm trying to implement a baseline prediction model of movie ratings (akin to the various baseline models from the NetFlix prize), with parameters learned via stochastic gradient descent. However, because both explanatory variables are categorical (users and movies), the design matrix is really big, and cannot fit into my RAM.
I thought that the sgd package would automagically find its way around this issue (since it's designed for large amounts of data), but that does not seem to be the case.
Does anyone know a way around this? Maybe a way to build the design matrix as a sparse matrix.
Cheers,
You can try to use Matrix::sparseMatrix
to create a triplet that will describe the matrix in a more efficient way.
You can also try to export your problem on Amazon EC2 and use and instance with more RAM or configure cluster to create mapped reduced job.
Check out the xgboost Package https://github.com/dmlc/xgboost and their documentation to understand how to deal with memory problems.
This is also a more practical tutorial: https://cran.r-project.org/web/packages/xgboost/vignettes/discoverYourData.html