python apache-spark-mllib naivebayes document-classification

Document classification in spark mllib

i want to classify documents if they belong to sports, entertainment, politics. i have created a bag of words which output somthing like :

(1, 'saurashtra') (1, 'saumyajit') (1, 'satyendra')

i want to implement naive bayes algorithm for classification using Spark mllib. My question is how to i convert this output into something that can naive bayes use as an input for classifcation like RDD or if there is any trick i can convert directly the html files into something that can be used by mllib naive bayes.

Solution

For text classification, you need:

A word dictionary
Convert document into vector using the dictionary
Label the document vectors:

doc_vec1 -> label1

doc_vec2 -> label2

...

This sample is pretty straghtforward.