Search code examples
javamavenwekasmote

Using SMOTE on java raises Comparison method violates its general contract



I'm working on a project in java and I need to use Weka's API. I use Maven to manage dependencies and, in particular, I have the following one:
<dependency>
   <groupId>nz.ac.waikato.cms.weka</groupId>
   <artifactId>weka-stable</artifactId>
   <version>3.8.5</version>
</dependency>

In this version, the SMOTE class is not kept, but I really need it; that's why I also added in my pom.xml the following dependency:

<dependency>
   <groupId>nz.ac.waikato.cms.weka</groupId>
   <artifactId>SMOTE</artifactId>
   <version>1.0.2</version>
</dependency>

In my Java code, i also try to develop the WalkForward validation technique: I can prepare both training set and testing set for each step, so i can use them in a loop in which what I do is the following:

for (...){
   var filtered = new FilteredClassifier();
   var smote = new SMOTE();
   filtered.setFilter(smote);
   filtered.setClassifier(new NaiveBayes());
   filtered.buildClassifier(trainingDataset);
   var currEvaluation = new Evaluation(testingDataset);
   currEvaluation.evaluateModel(filtered, testingDataset);
}

trainingDataset and testingDataset type is Instances and their value changes appropriately in each iteration. In the first iteration, no problem occurs, but in the second one the java.lang.IllegalArgumentException: Comparison method violates its general contract! is raised. The exception stack trace is:

java.lang.IllegalArgumentException: Comparison method violates its general contract!
    at java.base/java.util.TimSort.mergeLo(TimSort.java:781)
    at java.base/java.util.TimSort.mergeAt(TimSort.java:518)
    at java.base/java.util.TimSort.mergeCollapse(TimSort.java:448)
    at java.base/java.util.TimSort.sort(TimSort.java:245)
    at java.base/java.util.Arrays.sort(Arrays.java:1441)
    at java.base/java.util.List.sort(List.java:506)
    at java.base/java.util.Collections.sort(Collections.java:179)
    at weka.filters.supervised.instance.SMOTE.doSMOTE(SMOTE.java:637)
    at weka.filters.supervised.instance.SMOTE.batchFinished(SMOTE.java:489)
    at weka.filters.Filter.useFilter(Filter.java:708)
    at weka.classifiers.meta.FilteredClassifier.setUp(FilteredClassifier.java:719)
    at weka.classifiers.meta.FilteredClassifier.buildClassifier(FilteredClassifier.java:794)

Does anyone know how to solve the problem?
Thanks in advance.

EDIT: I forgot to say I'm using java 11.0.11.

EDIT 2: Based on the @fracpete answer, I deduce that the problem may be the sets creation. I state that I'm trying to predict bugginess of classes of another opensource project. Because of Walk Forward, I have 19 steps and should have 19 different training files and 19 testing files. To avoid this, I have a list of class InfoKeeper which keeps Instances for train and test for each step. During the creation of this array, i do the following:

  1. from the base ARFF file, i crete 2 temporary files: training test file keeping version 1 data, testing set file keeping version 2 data. Then I read these temp ARFFs to create the Instances class. These will be kept by InfoKeeper related on step 1.
  2. I append testing set file's row (only data, of course) in the training set files, so that it will keep version 1 and version 2 data. Then I override the training file to let it keeps the version 3 data. I read these temp ARFFs to get the Instances that will be kept by InfoKeeper related on step 2.

The code iterates on step 2 to create all the remaining InfoKeeper. May this operation be the problem?

I also tried to use @frecpete snippet, but the same error occurs. The files I used are the following:
training set file
testing set file

EDIT 3: this is how I compute files:

public class FilesCreator {

    private File basicArff;
    private Instances totalData;

    private ArrayList<Instance> testingInstances;
    private File testingSet;
    private File trainingSet;

    /* *******************************************************************/

    public FilesCreator(File csvFile, File arffFile, File training, File testing) 
           throws IOException {
        var loader = new CSVLoader();
        loader.setSource(csvFile);
        this.totalData = loader.getDataSet(); // get instances object
        this.basicArff = arffFile;
        this.testingSet = testing;
        this.trainingSet = training;
    }

    private ArrayList<Attribute> getAttributesList(){
        var attributes = new ArrayList<Attribute>();
        int i;
        for (i = 0; i < this.totalData.numAttributes(); i++)
            attributes.add(this.totalData.attribute(i));
        return attributes;
    }

    private void writeHeader(PrintWriter pf) {
        // just write the attributes in the given file. 
        // f is either this.testingSet or this.trainingSet
        pf.append("@relation " + this.totalData.relationName() + "\n\n");
        pf.flush();
        var attributes = this.getAttributesList();
        for (Attribute line : attributes){
            pf.append(line.toString() + "\n");
            pf.flush();
        }
        pf.append("\n@data\n");
        pf.flush();
    }

    /* *******************************************************************/
    /* testing file */

    // testing instances
    private void computeTestingSet(int indexRelease){
        int i;
        int currIndex;
        // re-initialize the list
        this.testingInstances = new ArrayList<>();
        for (i = 0; i < this.totalData.numInstances(); i++){
            // first attribute is the release index
            currIndex = (int) this.totalData.instance(i).value(0);
            if (currIndex == indexRelease)
                testingInstances.add(this.totalData.instance(i));
            else if (currIndex > indexRelease)
                break;
        }
    }

    // testing file
    private void computeTestingFile(int indexRelease){
        this.computeTestingSet(indexRelease);
        try(var fp = new PrintWriter(this.testingSet)) {

            this.writeHeader(fp);
            for (Instance line : this.testingInstances){
                fp.append(line.toString() + "\n");
                fp.flush();
            }
        } catch (IOException e) {
            var logger = Logger.getLogger(FilesCreator.class.getName());
            logger.log(Level.OFF, Arrays.toString(e.getStackTrace()));
        }
    }
    /* *******************************************************************/
    // training file
    private void computeTrainingFile(int indexRelease){
        int i;

        try(var fw = new FileWriter(this.trainingSet, true);
            var fp = new PrintWriter(fw)) {
            if (indexRelease == 1) {
                // first iteration: need the header.
                fp.print("");
                fp.flush();
                this.writeHeader(fp);
                for (i = 0; i < this.totalData.numInstances(); i++) {
                    if ( (int) this.totalData.instance(i).value(0) > indexRelease)
                        break;
                    fp.append(this.totalData.instance(i).toString() + "\n");
                    fp.flush();
                }
            }
            else {
                // in this case just append the testing instances, which
                // are the indexReleas+1-th data:
                for (Instance obj : this.testingInstances){
                    fp.append(obj.toString() + "\n");
                    fp.flush();
                }
            }
        } catch (IOException e) {
            var logger = Logger.getLogger(FilesCreator.class.getName());
            logger.log(Level.OFF, Arrays.toString(e.getStackTrace()));
        }
    }

    /* *******************************************************************/
    // public method
    public void computeFiles(int indexRelease){
        this.computeTrainingFile(indexRelease);
        this.computeTestingFile(indexRelease + 1);
    }
}

The last public method is invoked inside a loop of another class, starting from 1 to 19:

    FilesCreator filesCreator = new FilesCreator(csvFile, arffFile, training, testing);
    for (i = 1; i < 20; i++) {
        filesCreator.computeFiles(i);
        /* do something with files, such as getting Instances and
           use them for SMOTE computation */
    }

EDIT 4: I removed duplicated instances from totalData in FilesCreator by doing the following:

 var currDir = Paths.get(".").toAbsolutePath().normalize().toFile();
    var ext = ".arff";
    var tmpFile = File.createTempFile("without_replicated", ext, currDir);
    RemoveDuplicates.main(new String[]{"-i", this.basicArff.toPath().toString(), "-o", tmpFile.toPath().toString()});
    // output file has effective 0 instances repetitions 
    var arffLoader = new ArffLoader();
    arffLoader.setSource(tmpFile);
    this.totalData = arffLoader.getDataSet();
    Files.delete(tmpFile.toPath());

I cannot manually modify it because it's output of previous computation. The code works for iteration 2, but get the same error for iteration 3.
The files for this iteration are:
train_iteration4.arff
test_iteration4.arff
This is the very full arff file obtained by the previous snippet and it's the one which is loaded by arffLoader.setSource(tmpFile);:
full.arff


Solution

  • I solved the problem changing the smote dependency in my pom.xml in:

    <dependency>
       <groupId>nz.ac.waikato.cms.weka</groupId>
       <artifactId>SMOTE</artifactId>
       <version>1.0.3</version>
    </dependency>
    

    In this version, I don't have any problem and my code runs as expected. Hope this will help others.