I'm going to train a naive bayes classifier on a bunch of training document using Apache Spark (or Mahout in Hahoop). I'd like to use this model when I receive new documents to classify. I wonder to know whether there is any possibility to store the model when is trained and then load it in another Spark job later?
In Mahout's MapReduce backed NaiveBayes, the Model will be saved to the directory specified by the -o
parameter if training is via the CLI:
mahout trainnb
-i ${PATH_TO_TFIDF_VECTORS}
-o ${"path/to/model}/model
-li ${PATH_TO_MODEL}/labelindex
-ow
-c
See: http://mahout.apache.org/users/classification/bayesian.html
And retrieved via:
NaiveBayesModel model = NaiveBayesModel.materialize(("/path/to/model"), getConf());
Alternatively, using Mahout-Samsara's Spark backed Naive Bayes, a model can be trained from the command line and will be similarly be output to the path specified by the -o
parameter:
mahout spark-trainnb
-i ${PATH_TO_TFIDF_VECTORS}
-o ${/path/to/model}
-ow
-c
or the a model can be trained from within an application via:
val model = SparkNaiveBayes.train(aggregatedObservations, labelIndex, false)
Output to (HD)FS by:
model.dfsWrite("/path/to/model")
and retrieved via:
val retrievedModel = NBModel.dfsRead("/path/to/model")
See: http://mahout.apache.org/users/environment/classify-a-doc-from-the-shell.html