python machine-learning scikit-learn bigdata gradient-descent

How to fit successive batches in Scikit-Learn?

I have a large data set (>1 TB) which I wish to train using GradientBoostingRegressor of Scikit-Learn.

As the size of the data is well beyond my RAM capacity, I am thinking of splitting the data into chunks and 'fit()' them one-by-one successively.

I understand setting the 'warm_start' attribute to True keeps the weights after fit(). However, it seems that I need to increase the number of estimators as well for each successive call to fit().

Is it possible to fit() all the chunks of data first before increasing the number of estimators by one?

What is the best solution to my problem, ie. fitting a super-large data set?

Solution

You might want to try the partial_fit method from the SGD estimator. It's not a GBM but it works very nice and for the size of data you have you might get good results with a linear model and proper interactions.