Search code examples
javamachine-learningwekatext-miningsentiment-analysis

Java - Implementing Machine Learning methods on text mining


I have some texts and i would like to mine these by implementing Machine Learning methods in Java using Weka libraries. For that purpose, i've already did something so far but since whole code is too long i just want to show some key methods and get an idea about how to train and test my dataset, and interpret results etc.

FYI, i am processing tweets with Twitter4J.

First, i fetched the tweets and saved in text file(of course in ARFF format). Then I manually labeled them regarding their sentiments(positive,neutral,negative). Based on selected classifier, i created test set from my training set thanks to cross-validation. Finally i classified them and print the summary and confusion matrix.

Here is one of my classifiers: Naive Bayes code:

public static void ApplyNaiveBayes(Instances data) throws Exception {

    System.out.println("Applying Naive Bayes \n");
    data.setClassIndex(data.numAttributes() - 1); 
    StringToWordVector swv = new StringToWordVector();
    swv.setInputFormat(data);
    Instances dataFiltered = Filter.useFilter(data, swv);
    //System.out.println("Filtered data " +dataFiltered.toString());

    System.out.println("\n\nFiltered data:\n\n" + dataFiltered);

    Instances[][] split = crossValidationSplit(dataFiltered, 10);
    Instances[] trainingSets = split[0];
    Instances[] testingSets = split[1];


    NaiveBayes classifier = new NaiveBayes(); 

    FastVector predictions = new FastVector();


    classifier.buildClassifier(dataFiltered);
    System.out.println("\n\nClassifier model:\n\n" + classifier);     

    // Test the model
    for (int i = 0; i < trainingSets.length; i++) {
        classifier.buildClassifier(trainingSets[i]);
        // Test the model         
        Evaluation eTest = new Evaluation(trainingSets[i]);
        eTest.evaluateModel(classifier, testingSets[i]);

        // Print the result to the Weka explorer:
        String strSummary = eTest.toSummaryString();
        System.out.println(strSummary);

        // Get the confusion matrix
        double[][] cmMatrix = eTest.confusionMatrix();
        for(int row_i=0; row_i<cmMatrix.length; row_i++){
            for(int col_i=0; col_i<cmMatrix.length; col_i++){
                System.out.print(cmMatrix[row_i][col_i]);
                System.out.print("|");
            }
            System.out.println();
        }
    }
}

And FYI, crossValidationSplit method is here:

    public static Instances[][] crossValidationSplit(Instances data, int     
    numberOfFolds) {
        Instances[][] split = new Instances[2][numberOfFolds];

        for (int i = 0; i < numberOfFolds; i++) {
            split[0][i] = data.trainCV(numberOfFolds, i);
            split[1][i] = data.testCV(numberOfFolds, i);
        }

        return split;
    }

In the end, I've got 10 different results(because k=10). One of them is:

  Correctly Classified Instances           4               36.3636 %
  Incorrectly Classified Instances         7               63.6364 %
  Kappa statistic                          0.0723
  Mean absolute error                      0.427 
  Root mean squared error                  0.5922
  Relative absolute error                 93.4946 %
  Root relative squared error            116.5458 %
  Total Number of Instances               11     

  2.0|0.0|1.0|
  1.0|1.0|2.0|
  3.0|0.0|1.0|

So, how i can i interpret the results? Do you think i'm doing right about training and test sets? I want to obtain given text file's sentiment percent (positive,neutral,negative). How to infer my demand from these results? Thanks for reading...


Solution

  • Unfortunately your code is a bit confused.

    First of all, you train your model on the full set of your set:

    classifier.buildClassifier(dataFiltered);
    

    then you retrain your model inside your for loop:

    for (int i = 0; i < trainingSets.length; i++) {
        classifier.buildClassifier(trainingSets[i]);
        ...
     }
    

    than you calculate the confusion mtx too. I think it is unnecessary.

    In my opinion you need to apply Evaluation.crossValidateModel() method as the follows: //set the class index dataFiltered.setClassIndex(dataFiltered.numAttributes() - 1); //build a model -- choose a classifier as you want classifier.buildClassifier(dataFiltered); Evaluation eval = new Evaluation(dataFiltered); eval.crossValidateModel(classifier, dataFiltered, 10, new Random(1)); //print stats -- do not require to calculate confusion mtx, weka do it! System.out.println(classifier); System.out.println(eval.toSummaryString()); System.out.println(eval.toMatrixString()); System.out.println(eval.toClassDetailsString());