I have a training set with 250,000 instances, which is too large for Weka classifiers to handle (although the data loads into the Weka UI just fine, any attempt to run a non-trivial classifier results in an out-of-memory, even with the machine's entire 8GB RAM dedicated to JVM heap).
Because this involves geographical data, it should perform quite well if I cluster on latitude/longitude and train separate classifiers on each cluster.
Is there a way to do this easily on the Weka command-line or KnowledgeFlow, without having to mess with the ARFF file? (I prefer to keep a single large ARFF file so different split strategies can be evaluated within Weka)
I looked into Bagging and Cross-Fold Validation, but I don't think they are a fit for my problem, as I don't want the data split up at random, but kept together based on similarity of location.
Adding these two options when training the classifier has an amazing impact on performance, so it obviates the need to split the datasets:
-no-cv -v
Training time on RandomForest, J48, and LWL goes down to under 2 minutes where without these options the algorithms were non-terminating (after many hours) or ran out of memory.
Previous answer based on file-splitting kept in case it helps someone with a REALLY large dataset:
I have found a partial solution. The following command line (in Windows) will split the data in one ARFF file into ten separate ARFF files based on clustering (I'm using K-Means because it runs very fast, but will probably switch to EM or DBSCAN later):
java -Xmx4096m -cp weka.jar weka.filters.unsupervised.attribute.AddCluster -i %TEMP%\StreetSet.arff -o \temp\clusters.arff -W "weka.clusterers.SimpleKMeans -N 10 -num-slots 4"
for /l %i in (1,1,10) do java -Xmx4096m -cp weka.jar weka.filters.unsupervised.instance.RemoveWithValues -C last -L %i -V -i \temp\clusters.arff -o \temp\cluster%i.arff
This not quite what I wanted, because with this approach I can't use the Experimenter to try different parameter combinations, and now it complicates the evaluation/testing of new instances, having to go through separate command-lines to split by cluster first. I was hoping this all to be transparently handled within a Weka meta-classifier.