Search code examples
apache-sparkmahoutapache-spark-mllib

can a trained classification model be stored in Apache Spark?


I'm going to train a naive bayes classifier on a bunch of training document using Apache Spark (or Mahout in Hahoop). I'd like to use this model when I receive new documents to classify. I wonder to know whether there is any possibility to store the model when is trained and then load it in another Spark job later?


Solution

  • In Mahout's MapReduce backed NaiveBayes, the Model will be saved to the directory specified by the -o parameter if training is via the CLI:

    mahout trainnb
      -i ${PATH_TO_TFIDF_VECTORS} 
      -o ${"path/to/model}/model 
      -li ${PATH_TO_MODEL}/labelindex 
      -ow 
      -c
    

    See: http://mahout.apache.org/users/classification/bayesian.html

    And retrieved via:

    NaiveBayesModel model = NaiveBayesModel.materialize(("/path/to/model"), getConf());
    

    Alternatively, using Mahout-Samsara's Spark backed Naive Bayes, a model can be trained from the command line and will be similarly be output to the path specified by the -o parameter:

    mahout spark-trainnb
      -i ${PATH_TO_TFIDF_VECTORS} 
      -o ${/path/to/model}
      -ow 
      -c
    

    or the a model can be trained from within an application via:

    val model = SparkNaiveBayes.train(aggregatedObservations, labelIndex, false)
    

    Output to (HD)FS by:

    model.dfsWrite("/path/to/model")
    

    and retrieved via:

    val retrievedModel =  NBModel.dfsRead("/path/to/model")
    

    See: http://mahout.apache.org/users/environment/classify-a-doc-from-the-shell.html