Search code examples
javaweka

How to use created model with new data in Weka


I'm trying some tests on weka, hope someone can help me and i can made myself clear.

Step 1: Tokenize my data

@attribute text string
@attribute @@class@@ {derrota,empate,win}

@data
'O Grêmio perdeu para o Cruzeiro por 1 a 0',derrota
'O Grêmio venceu o Palmeiras em um grande jogo de futebol, nesta quarta-feira na Arena',vitoria

Step 2: Build model on tokenized data

After loading this i apply a StringToWordVector. After applying this filter i save a new arff file with the words tokenized. Something like..

@attribute @@class@@ {derrota,vitoria,win}
@attribute o numeric
@attribute grêmio numeric
@attribute perdeu numeric
@attribute venceu numeric
@ and so on .....

@data
{0 derrota, 1 1, 2 1, 3 1, 4 0, ...}
{0 vitoria, 1 1, 2 1, 3 0, 4 1, ...}

Ok! Now based on this arff i build my classifier model and save it.

Step 3: Test with "simulated new data"

If i want to test my model with "simulated new data" what im doing actually is editing this last arff and making a line like

{0 ?, 1 1, 2 1, 3 1, 4 0, ...}

Step 4(my problem): How to test with really new data

So far so good. My problem is when i need to use this model with 'really' new data. For example, if i have a string with "O Grêmio caiu diante do Palmeiras". I have 4 new words that doesn't exist in my model and 2 that exist.

How can i create a arff file with this new data that can be fitted in my model? (ok i know that the 4 new words will not be present, but how can i work with this?)

After supply a different test data the following message appears

enter image description here


Solution

  • If you use Weka programmatically then you can do this fairly easy.

    • Create your training file (e.g training.arff)
    • Create Instances from training file. Instances trainingData = ..
    • Use StringToWordVector to transform your string attributes to number representation:

    sample code:

        StringToWordVector() filter = new StringToWordVector(); 
        filter.setWordsToKeep(1000000);
        if(useIdf){
            filter.setIDFTransform(true);
        }
        filter.setTFTransform(true);
        filter.setLowerCaseTokens(true);
        filter.setOutputWordCounts(true);
        filter.setMinTermFreq(minTermFreq);
        filter.setNormalizeDocLength(new SelectedTag(StringToWordVector.FILTER_NORMALIZE_ALL,StringToWordVector.TAGS_FILTER));
        NGramTokenizer t = new NGramTokenizer();
        t.setNGramMaxSize(maxGrams);
        t.setNGramMinSize(minGrams);    
        filter.setTokenizer(t);  
        WordsFromFile stopwords = new WordsFromFile();
        stopwords.setStopwords(new File("data/stopwords/stopwords.txt"));
        filter.setStopwordsHandler(stopwords);
        if (useStemmer){
            Stemmer s = new /*Iterated*/LovinsStemmer();
            filter.setStemmer(s);
        }
        filter.setInputFormat(trainingData);
    
    • Apply the filter to trainingData: trainingData = Filter.useFilter(trainingData, filter);

    • Select a classifier to create your model

    sample code for LibLinear classifier

            Classifier cls = null;
            LibLINEAR liblinear = new LibLINEAR();
            liblinear.setSVMType(new SelectedTag(0, LibLINEAR.TAGS_SVMTYPE));
            liblinear.setProbabilityEstimates(true);
            // liblinear.setBias(1); // default value
            cls = liblinear;
            cls.buildClassifier(trainingData);
    
    • Save model

    sample code

        System.out.println("Saving the model...");
        ObjectOutputStream oos;
        oos = new ObjectOutputStream(new FileOutputStream(path+"mymodel.model"));
        oos.writeObject(cls);
        oos.flush();
        oos.close();
    
    • Create a testing file (e.g testing.arff)

    • Create Instances from training file: Instances testingData=...

    • Load classifier

    sample code

    Classifier myCls = (Classifier) weka.core.SerializationHelper.read(path+"mymodel.model");
    
    • Use the same StringToWordVector filter as above or create a new one for testingData, but remember to use the trainingData for this command:filter.setInputFormat(trainingData); This will keep the format of training set and will not add words that are not in training set.

    • Apply the filter to testingData: testingData = Filter.useFilter(testingData, filter);

    • Classify!

    sample code

     for (int j = 0; j < testingData.numInstances(); j++) {
        double res = myCls.classifyInstance(testingData.get(j));
     }
    

    1. Not sure if this can be done through GUI.
    2. Save and load steps are optional.

    Edit: after some digging in the Weka GUI i think it is possible to do it. In the classify tab set your testing set at the Supply test set field. After that your sets should normally be incompatible. To fix this click yes in the following dialog

    enter image description here

    and you are good to go.