python machine-learning scikit-learn distributed-computing model-fitting

Is it possible to combine multiple partially fit estimators in sklearn?

I have a lot of data and I want to parallelize estimator fitting by splitting up my data and fitting multiple estimators running in multiple threads, or multiple machines.

Some estimators provide a partial_fit API for out-of-core learning (e.g. PassiveAggressiveClassifier here)

Is it possible to have multiple estimators fit partially, and then combine their individual fits into a single estimator?

Solution

Not using the standard API. You can just average the coef_ and intercept_ and that will produce a meaningful estimator. Do you want to parallelize over one core or over a network? There might be more efficient options for you, most of which will require a little more work. There are parallel implementations of SGD, but these will probably only pay of for huge data sets. How large is your data (number of samples, number of features, sparsity)?