I have some texts and i would like to mine these by implementing Machine Learning methods in Java using Weka libraries. For that purpose, i've already did something so far but since whole code is too long i just want to show some key methods and get an idea about how to train and test my dataset, and interpret results etc.
FYI, i am processing tweets with Twitter4J.
First, i fetched the tweets and saved in text file(of course in ARFF format). Then I manually labeled them regarding their sentiments(positive,neutral,negative). Based on selected classifier, i created test set from my training set thanks to cross-validation. Finally i classified them and print the summary and confusion matrix.
Here is one of my classifiers: Naive Bayes code:
public static void ApplyNaiveBayes(Instances data) throws Exception {
System.out.println("Applying Naive Bayes \n");
data.setClassIndex(data.numAttributes() - 1);
StringToWordVector swv = new StringToWordVector();
swv.setInputFormat(data);
Instances dataFiltered = Filter.useFilter(data, swv);
//System.out.println("Filtered data " +dataFiltered.toString());
System.out.println("\n\nFiltered data:\n\n" + dataFiltered);
Instances[][] split = crossValidationSplit(dataFiltered, 10);
Instances[] trainingSets = split[0];
Instances[] testingSets = split[1];
NaiveBayes classifier = new NaiveBayes();
FastVector predictions = new FastVector();
classifier.buildClassifier(dataFiltered);
System.out.println("\n\nClassifier model:\n\n" + classifier);
// Test the model
for (int i = 0; i < trainingSets.length; i++) {
classifier.buildClassifier(trainingSets[i]);
// Test the model
Evaluation eTest = new Evaluation(trainingSets[i]);
eTest.evaluateModel(classifier, testingSets[i]);
// Print the result to the Weka explorer:
String strSummary = eTest.toSummaryString();
System.out.println(strSummary);
// Get the confusion matrix
double[][] cmMatrix = eTest.confusionMatrix();
for(int row_i=0; row_i<cmMatrix.length; row_i++){
for(int col_i=0; col_i<cmMatrix.length; col_i++){
System.out.print(cmMatrix[row_i][col_i]);
System.out.print("|");
}
System.out.println();
}
}
}
And FYI, crossValidationSplit method is here:
public static Instances[][] crossValidationSplit(Instances data, int
numberOfFolds) {
Instances[][] split = new Instances[2][numberOfFolds];
for (int i = 0; i < numberOfFolds; i++) {
split[0][i] = data.trainCV(numberOfFolds, i);
split[1][i] = data.testCV(numberOfFolds, i);
}
return split;
}
In the end, I've got 10 different results(because k=10). One of them is:
Correctly Classified Instances 4 36.3636 %
Incorrectly Classified Instances 7 63.6364 %
Kappa statistic 0.0723
Mean absolute error 0.427
Root mean squared error 0.5922
Relative absolute error 93.4946 %
Root relative squared error 116.5458 %
Total Number of Instances 11
2.0|0.0|1.0|
1.0|1.0|2.0|
3.0|0.0|1.0|
So, how i can i interpret the results? Do you think i'm doing right about training and test sets? I want to obtain given text file's sentiment percent (positive,neutral,negative). How to infer my demand from these results? Thanks for reading...
Unfortunately your code is a bit confused.
First of all, you train your model on the full set of your set:
classifier.buildClassifier(dataFiltered);
then you retrain your model inside your for loop:
for (int i = 0; i < trainingSets.length; i++) {
classifier.buildClassifier(trainingSets[i]);
...
}
than you calculate the confusion mtx too. I think it is unnecessary.
In my opinion you need to apply Evaluation.crossValidateModel()
method as the follows:
//set the class index
dataFiltered.setClassIndex(dataFiltered.numAttributes() - 1);
//build a model -- choose a classifier as you want
classifier.buildClassifier(dataFiltered);
Evaluation eval = new Evaluation(dataFiltered);
eval.crossValidateModel(classifier, dataFiltered, 10, new Random(1));
//print stats -- do not require to calculate confusion mtx, weka do it!
System.out.println(classifier);
System.out.println(eval.toSummaryString());
System.out.println(eval.toMatrixString());
System.out.println(eval.toClassDetailsString());