Search code examples
wekafeature-selection

Weka exception: Train and test file not compatible! thrown despite having filters that shall correct that incompatibility


Let's say I have the following data in ARFF format:

TRAIN:

@ATTRIBUTE A NUMERIC
@ATTRIBUTE B NUMERIC
@ATTRIBUTE C NUMERIC

TEST

@ATTRIBUTE ID NUMERIC
@ATTRIBUTE A NUMERIC
@ATTRIBUTE B NUMERIC
@ATTRIBUTE C NUMERIC
@ATTRIBUTE D NUMERIC
@ATTRIBUTE E NUMERIC

Now, to explain the attribute difference, on the TRAIN data, a feature selection was performed, so some attributes were removed. I need to get predictions on TEST dataset from classifier trained on TRAIN dataset, but TRAIN and TEST headers do not match. I tried to solve it by applying RemoveByName filters with the excess feature names as parameters, however it still fails with an error, that Train and test file not compatible!

I was reading this correspondence, where it is stated, that filters are applied also to test data, so they are compatible, but it looks like they are not in my case.

Do I have to create a separate new file externally for each subset of selected features in TRAIN file, or can I use FilteredClassifier to remove the features that are not needed? Or, can I somehow specify which attributes to use for prediction?

EDIT1:

I need to run everything from command line, I need to be able to supply variable parameters and variable filters for both the base classifier and the FilteredClassifier As @zbicyclist suggested, I tried to make it work through the InputMappedClassifier, by a command as follows:

java -Xmx4096m -cp data/java/weka.jar weka.classifiers.misc.InputMappedClassifier -t train.arff -T test_bin.arff -classifications weka.classifiers.evaluation.output.prediction.CSV -p first -file FILE.arff -suppress -S 1 -W weka.classifiers.meta.FilteredClassifier -- -F weka.filters.MultiFilter -F "weka.filters.unsupervised.attribute.RemoveByName -E ^ID$" -F "weka.filters.unsupervised.attribute.RemoveByName -E ^OD_VALUE$" -W weka.classifiers.rules.DecisionTable -- -I

Which looks like this, when I add newlines (which must be ommited before running it):

java -Xmx4096m -cp data/java/weka.jar 
weka.classifiers.misc.InputMappedClassifier
  -t train.arff
  -T test_bin.arff
  -classifications weka.classifiers.evaluation.output.prediction.CSV
  -p first
  -file FILE.arff
  -suppress
  -S 1
  -W weka.classifiers.meta.FilteredClassifier
--
  -F weka.filters.MultiFilter
  -F "weka.filters.unsupervised.attribute.RemoveByName -E ^ID$"
  -F "weka.filters.unsupervised.attribute.RemoveByName -E ^OD_VALUE$"
  -W weka.classifiers.rules.DecisionTable
--
  -I

It does not work though and says that: Weka exception: Illegal options: -F weka.filters.unsupervised.attribute.RemoveByName -E ^ID$ -F weka.filters.unsupervised.attribute.RemoveByName -E ^OD_VALUE$

Can anyone help me with nesting the command properly, so I can wrap the base classifier into FilteredClassifier and then wrap the filtered classifier into InputClassifier?


Solution

  • The problem is, that the inputs are probably compared before applying the filtering, thus you need to wrap it into InputMappedClassifier and filter unnecesary columns only after the input train features are mapped to correct input test features

    I managed to come up with following command:

    java -Xmx4096m -cp data/java/weka.jar weka.classifiers.misc.InputMappedClassifier \
    -t train.arff \
    -T test_bin.arff \
    -classifications \
        "weka.classifiers.evaluation.output.prediction.CSV \
        -p first \
        -file FILE.arff \
        -suppress" \
    -W weka.classifiers.meta.FilteredClassifier\
    --\
        -F "weka.filters.MultiFilter -F \"weka.filters.unsupervised.attribute.RemoveByName -E ^ID$\" -F \"weka.filters.unsupervised.attribute.RemoveByName -E ^OD_VALUE$\""\
        -S 1\
        -W weka.classifiers.rules.DecisionTable \
        --\
            -I
    

    Which seems to do what I need.

    It is possible to nest classifiers by using the -W <classifier.name> argument last and then introducing the parameters for the nested classifier after the -- argument. No obscure quote backslashing required.