I'm trying to figure out if it's possible to "retrain" a model when new and unknown data is available for training. My idea goes like this:
Make an initial training with some dataset and generate a model. That model can then be saved for future use (with the write().save()
command). Every time my program runs I will be calling that model instead of making a new one by training it on the same or similar data (I know I can also load a model with the load()
command). However, the data I will be working with is bound to change at some point significantly enough that the predictions made by my model will not be that correct anymore. However, that doesn't mean that it's wrong. It just means it needs some readjustments, and that's where the "retraining" comes to mind. I would like to take my old model and retrain it with the new data, and save it again. Is it possible to do this in Apache Spark? Or would I need to create a new model based solely on the new data?
FYI, I'm talking about a classification model, more specifically about Random Forest or GBT.
You can combine old and new data and train a new model using all available data.
There is no option for incremental training with tree models. You cannot just start with old model and add new data.
You could create some type of ensemble model. Train new model on new data only and then make prediction using both old and new model, weighting probabilities for both. It is not builtin so you'll have implement it yourself.