Search code examples
pythonapache-spark-mllibnaivebayesdocument-classification

Document classification in spark mllib


i want to classify documents if they belong to sports, entertainment, politics. i have created a bag of words which output somthing like :

(1, 'saurashtra') (1, 'saumyajit') (1, 'satyendra')

i want to implement naive bayes algorithm for classification using Spark mllib. My question is how to i convert this output into something that can naive bayes use as an input for classifcation like RDD or if there is any trick i can convert directly the html files into something that can be used by mllib naive bayes.


Solution

  • For text classification, you need:

    • A word dictionary
    • Convert document into vector using the dictionary
    • Label the document vectors:

      doc_vec1 -> label1

      doc_vec2 -> label2

      ...

    This sample is pretty straghtforward.