scala apache-spark machine-learning apache-spark-mllib

How to create an Estimator that trains new samples after already fitted to initial dataset?

I'm trying to create my own Estimator following this example I found in the Spark source code DeveloperApiExample.scala.

But in this example, everytime I call fit() method in Estimator, it will return a new Model.

I want something like fitting again to train more samples that was not trained yet.

I thought in creating a new method in the Model class to do so. But I'm not sure if it makes sense. It's maybe good to know that my model don't need to process all dataset again to train a new sample and we don't want to change the model structure.

Solution

The base class for a spark ml Estimator is defined here. As you can see, the class method fit is a vanilla call to train the model using the input data.

You should reference something like the LogisticRegression class, specifically the trainOnRows function where the input is an RDD and optionally an initial coefficient matrix (output of a trained model). This will allow you to iteratively train a model on different data sets.

For what you need to achieve, please remember that your algorithm of choice must be able to support iterative updates. For example, glm's, neural networks, tree ensembles etc.