I am trying to train a neural net (backprop + gradient descent) in python with features I am constructing on top of the google books 2-grams (English), it will end up being around a billion rows of data with 20 features each row. This will easily exceed my memory and hence using in-memory arrays such as numpy would not be an option as it requires loading the complete training set.
I looked into memory mapping in numpy which could solve the problem for input layer (which are readonly), but I will also need to store and manipulate my internal layers in the net which requires extensive data read/write and considering the size of data, performance is extremely crucial in this process as could save days of processing for me.
Is there a way to train the model without having to load the complete training set in memory for each iteration of cost (loss) minimization?
What you are probably looking for is minibatching. In general many methods of training neural nets are gradient based, and as your loss function is a function of trianing set - so is the gradient. As you said - it may exceed your memory. Luckily, for additive loss functions (and most you will ever use - are additive) one can prove that you can substitute full gradient descent with stochastic (or minibatch) gradient descent and still converge to a local minima. Nowadays it is very often practise to use batches of 32, 64 or 128 rows, thus rather easy to fit in your memory. Such networks can actually converge faster to solution than the ones trained with full gradient, as you make N / 128 moves per dataset instead of just one. Even if each of them is rather rough - as a combination they work pretty well.