Search code examples
machine-learningweka

How can I get Weka classifiers to use a lot less memory and CPU time?


I have a training set with 250,000 instances, which is too large for Weka classifiers to handle (although the data loads into the Weka UI just fine, any attempt to run a non-trivial classifier results in an out-of-memory, even with the machine's entire 8GB RAM dedicated to JVM heap).

Because this involves geographical data, it should perform quite well if I cluster on latitude/longitude and train separate classifiers on each cluster.

Is there a way to do this easily on the Weka command-line or KnowledgeFlow, without having to mess with the ARFF file? (I prefer to keep a single large ARFF file so different split strategies can be evaluated within Weka)

I looked into Bagging and Cross-Fold Validation, but I don't think they are a fit for my problem, as I don't want the data split up at random, but kept together based on similarity of location.


Solution

  • Adding these two options when training the classifier has an amazing impact on performance, so it obviates the need to split the datasets:

    -no-cv -v
    

    Training time on RandomForest, J48, and LWL goes down to under 2 minutes where without these options the algorithms were non-terminating (after many hours) or ran out of memory.

    Previous answer based on file-splitting kept in case it helps someone with a REALLY large dataset:

    I have found a partial solution. The following command line (in Windows) will split the data in one ARFF file into ten separate ARFF files based on clustering (I'm using K-Means because it runs very fast, but will probably switch to EM or DBSCAN later):

    java -Xmx4096m -cp weka.jar weka.filters.unsupervised.attribute.AddCluster -i %TEMP%\StreetSet.arff -o \temp\clusters.arff -W "weka.clusterers.SimpleKMeans -N 10 -num-slots 4"

    for /l %i in (1,1,10) do java -Xmx4096m -cp weka.jar weka.filters.unsupervised.instance.RemoveWithValues -C last -L %i -V -i \temp\clusters.arff -o \temp\cluster%i.arff

    This not quite what I wanted, because with this approach I can't use the Experimenter to try different parameter combinations, and now it complicates the evaluation/testing of new instances, having to go through separate command-lines to split by cluster first. I was hoping this all to be transparently handled within a Weka meta-classifier.