can a trained classification model be stored in Apache Spark?

I'm going to train a naive bayes classifier on a bunch of training document using Apache Spark (or Mahout in Hahoop). I'd like to use this model when I receive new documents to classify. I wonder to know whether there is any possibility to store the model when is trained and then load it in another Spark job later?

Solution

In Mahout's MapReduce backed NaiveBayes, the Model will be saved to the directory specified by the -o parameter if training is via the CLI:

mahout trainnb
  -i ${PATH_TO_TFIDF_VECTORS} 
  -o ${"path/to/model}/model 
  -li ${PATH_TO_MODEL}/labelindex 
  -ow 
  -c

See: http://mahout.apache.org/users/classification/bayesian.html

And retrieved via:

NaiveBayesModel model = NaiveBayesModel.materialize(("/path/to/model"), getConf());

Alternatively, using Mahout-Samsara's Spark backed Naive Bayes, a model can be trained from the command line and will be similarly be output to the path specified by the -o parameter:

mahout spark-trainnb
  -i ${PATH_TO_TFIDF_VECTORS} 
  -o ${/path/to/model}
  -ow 
  -c

or the a model can be trained from within an application via:

val model = SparkNaiveBayes.train(aggregatedObservations, labelIndex, false)

Output to (HD)FS by:

model.dfsWrite("/path/to/model")

and retrieved via:

val retrievedModel =  NBModel.dfsRead("/path/to/model")

See: http://mahout.apache.org/users/environment/classify-a-doc-from-the-shell.html