Search code examples
javawekaindexoutofboundsexception

Weka CSVSaver indexing issue


I am using Weka to implement a bunch of NLP algorithms. For this, I want to write the datasets I create (from plain texts) to csv files. The instances are created properly. I have tested the instance creation process by manually checking very small parts of the dataset (e.g. just two texts with 10 words each). I have also used Weka's k-means clusterer directly on the instances I create, and it runs flawlessly.

But, when I try to use CSVSaver to save the instances to a file, I get an indexOutOfBoundsException. As far as I can see, both the methods Saver#writeBatch() and Saver#writeIncremental() are looping all the way up to and including the length of the instance. That baffles me! Java is 0-indexed, and an instance object is also 0-indexed. So why is Weka looping until size() and not size() - 1? Am I missing something extremely obvious here?

The relevant portion of the code is as follows:

CSVSaver csvSaver = new CSVSaver();
csvSaver.setFieldSeparator("\t");
csvSaver.setFile(new File(optionSet.valueOf("doc-output").toString()));
csvSaver.setMaxDecimalPlaces(3);
csvSaver.setNoHeaderRow(false);
csvSaver.setInstances(documentInstances);
csvSaver.setRetrieval(AbstractSaver.INCREMENTAL);
for (Instance instance : csvSaver.getInstances())
    csvSaver.writeIncremental(instance);

The very first iteration of the for loop writes the header row, which contains 346 elements (indexed from 0 to 345). Weka writes all of them, and then throws the following error:

java.lang.IndexOutOfBoundsException: Index: 346, Size: 346
    at java.util.ArrayList.rangeCheck(ArrayList.java:635)
    at java.util.ArrayList.get(ArrayList.java:411)
    at weka.core.Instances.attribute(Instances.java:341)
    at weka.core.AbstractInstance.toString(AbstractInstance.java:744)
    at weka.core.converters.CSVSaver.instanceToString(CSVSaver.java:578)
    at weka.core.converters.CSVSaver.writeIncremental(CSVSaver.java:472)

Why is Weka going all the way to index 346, when even a Java beginner knows to stop at 345?


Solution

  • I managed to find a way around this by forcing each instance to be a DenseInstance as follows:

    for (Instance instance : csvSaver.getInstances()) {
        csvSaver.writeIncremental(new DenseInstance(instance));
    }
    

    This works perfectly, and of course, the csv output is correct.

    This solution is only a work-around, though, and I would prefer it if someone finds the real reason behind this error.