Search code examples
classificationmahouttraining-datatest-data

Mahout 0.9: Using own test set instead of using split command


I have referred to these two links to run mahout NB classifier

[1] http://tharindu-rusira.blogspot.com/2014/01/naive-bayes-classification-apache-mahout.html
[2] http://chimpler.wordpress.com/2013/03/13/using-the-mahout-naive-bayes-classifier-to-automatically-classify-twitter-messages/

I would like to use my own test set instead of having mahout splitting my data into training and test sets (80:20). How can I achieve that?


Solution

  • Take two datasets for is for training & one for testing.

    Run below commands on both sets:
    1. seqdirectory
    2. seq2sparse

    Now you will have vectors generated for both datasets.
    - Run trainnb command using first dataset's vectors output. So instead of training a model on 80% of the data, we are using the whole dataset.
    - Run testnb command using second dataset's vectors output. This is not the 20% of the data, it's completely new dataset, solely used for testing.

    So instead of using mahout split, we have specified our own dataset for testing your model.