I have a question about Mahout: why do I get the same test results (the same model test accuracy - 80%) in confusion matrix when I test my trained Naive Bayes model with complementary model and standart model approach ?
Here are my steps, which I used:
# mahout seq2sparse --input /user/root/data-seq/chunk-0 --output /user/root/vectors -ow -wt tfidf -md 2 -x 95 -n 2 -nr 2
# mahout split --input data-vectors/tfidf-vectors --trainingOutput training-vectors --testOutput test-vectors --randomSelectionPct 30 --overwrite --sequenceFiles -xm sequential
ComplementaryNaiveBayesClassifier: # mahout trainnb -i training-vectors -el -li labelindex -o model -ow -c
b) StandardNaiveBayesClassifier: # mahout trainnb -i training-vectors -el -li labelindex -o model -ow
ComplementaryNaiveBayesClassifier: # mahout testnb -i training-vectors -m model -l labelindex -ow -o tweets-testing -c
b) StandardNaiveBayesClassifier: # mahout testnb -i training-vectors -m model -l labelindex -ow -o tweets-testing
Maybe because of Standard Naive Bayes does not use weight normalization but I used it in first step by setting parameter: -n 2
? If it is true, means I should not use this parameter while creating a vectors if I want to compare these algorithms performance?
The -n 2 option that you're referring to for mahout seq2sparse
is actually the specifying the L_p norm for to use for length normalization[1]. So mahout seq2sparse ... -n 2
uses L_2 length normalization of the TF-IDF vectors. Alternatively you could use the -lnorm
for log-normalization. This is part of the preprocessing step before used for both Complement and Standard Naive Bayes[2].
Weight normalization is different from length normalization and is not used in Mahout 0.7.
Weight normalization is used in the upcoming 1.0 release so to get the best comparison of Standard and Complement Naive Bayes you should checkout and build a copy of the latest trunk: http://mahout.apache.org/developers/buildingmahout.html.
You should see a significant difference between Standard and Complement Naive Bayes if you upgrade to the latest trunk.
[1] mahout.apache.org/users/basics/creating-vectors-from-text.html
[2] http://mahout.apache.org/users/classification/bayesian.html