Search code examples
wekatext-classification

How can I use my text classifier in practice? As of getting the tf-idf values of new comments


Let say I am building classifier in java which would classify comments to spam or not. The data set is simple it have two attributes the string comment and the nominal class.

Now I need to filter my training data set using StringToWordVector filter. My first problem is with test data set if it is filtered it will be different than the training set attributes. I researched and found that I can use batch filtering like:

    StringToWordVector filter = new StringToWordVector();
    //Here I will set the options, I would be using tf-idf and someothers
    filter.setInputFormat(TrainingData);

Now is this approach correct? So If use this filter both the data sets should be compatible but well they be filtered in correct way? I am afraid the tf-idf values of the testing would be affected in a way that would reduce the accuracy.

Now to my main question how can I use my classifier in practice? in practice I am going to get a single comment which would be string I figured I would make it an instance but how would I filter it to get the tf-idf values to classify it?!! I figured maybe I could add the comment to the original training data set and recalculate the tf-idf everytime, but is that how it is done in practice?


Solution

  • I am attempting to answer the question using a different text classification task than spam classification.

    Say, I have the following training data:

    "The US government had imposed extra taxes on crude oil", petrolium
    "The German manufacturers are observing different genes of Canola oil", non-petrolium
    

    And the following test data:

    "Canada is famous for producing quality corn oil", ?
    

    Now, consider you are going to use Naive Bayes and use StringToWordVector filter. If you apply the filter on training and test data separately, you will have two very different word vectors. Each term in the training and test data will become a feature and therefore you will get an error like "Training and test data are not compatible". So, the solution is to use FilteredClassifier that takes both the choice of classifier (in our case Naive Bayes) and the filter (in our case StringToWordVector). You will need something similar to what follows:

    private NaiveBayes nb;
    private FilteredClassifier fc;
    private StringToWordVector filter;
    private double[] clsLabel;
    
    // Set the filter--->
    filter = new StringToWordVector();
    filter.setTokenizer(tokenizer); 
    filter.setWordsToKeep(1000000); 
    filter.setDoNotOperateOnPerClassBasis(true); 
    filter.setLowerCaseTokens(true);
    filter.setTFTransform(true);
    filter.setIDFTransform(true);
    filter.setStopwords(stopwords);
    
    filter.setInputFormat(trainingData);    
    //<---setting of filter ends
    
    //setting the classifier--->
    fc = new FilteredClassifier();
    nb = new NaiveBayes();      
    fc.setFilter(filter);
    fc.setClassifier(nb);
    //<---setting of the classifier ends
    
    fc.buildClassifier(trainingData);
    
    //Classification--->
            clsLabel = new double[testData.numInstances()]; //holds class label of the test documents
            //for each test document--->
            for (int i = 0; i < testData.numInstances(); i ++){
                try {
                    clsLabel[i] = fc.classifyInstance(testData.instance(i));
                } catch (Exception e) {
                    System.out.println("Error from Classification.classify(). Cannot classify instance");
                }
                testData.instance(i).setClassValue(clsLabel[i]);
            }//end for
            //<---classification ends
    

    NB. the TF-IDF calculations of training and test data will be done separately.