Search code examples
machine-learningmahoutsentiment-analysis

Mahout for sentiment analysis


Using mahout I am able to classify sentiment of data . But I am stuck with a confusion matrix.

I am using mahout 0.7 naive bayes algorithms to classify sentiment of tweets. I use trainnb and testnb naive bayes classifiers to train the classifier and classify sentiment of tweets as 'positive' ,'negative' or 'neutral'.

Sample positive training set

      'positive','i love my i phone'
      'positive' , it's pleasure to have i phone'  

Similarly I have prepared training samples of negative and neutral, it is a huge data set.

The sample test data tweets I am providing is without including sentiments.

  'it is nice model'
  'simply fantastic ' 

I am able to run the mahout classification algorithm, and it gives output of classified instances as confusion matrix .

Next step I need to find out which tweets are showing positive sentiment and which are negative. expected output using classification: to tag text with the sentiment.

       'negative','very bad btr life time'
      'positive' , 'i phone has excellent design features' 

In mahout which algorithm do I need to implement to get output in the above format. or any custom source implementation is required.

To display data 'kindly' suggest me algorithms that apache mahout provides, which will be suitable for my twitter data sentiment analysis.


Solution

  • In general to classify some text you need to run Naive Bayes with different priors (positive and negative in your case) and then just chose the one that results in greater value.

    This excerpt from the Mahout book has some examples. See Listing 2:

    Parameters p = new Parameters();
    p.set("basePath", modelDir.getCanonicalPath());9
    Datastore ds = new InMemoryBayesDatastore(p);
    Algorithm a = new BayesAlgorithm();
    ClassifierContext ctx = new ClassifierContext(a,ds);
    ctx.initialize();
    
    ....
    
    ClassifierResult result = ctx.classifyDocument(tokens, defaultCategory);
    

    Here result should hold either "positive" or "negative" label.