Search code examples
javaregexweka

Remove Attributes by Name. Filter broken?


There is an attribute filter which should remove each attribute which is matching a specified regular Expression from a set of Instances.

I have problems with the RegEx.

I tried several simple which all are valid (tested on regexr). But the Filter seems to not accept them.

Following the relevant code.

Instances dataset1_x=new Instances(dataset1);

RemoveByName filterX=new RemoveByName();
filterX.setInputFormat(dataset1_x);
filterX.setInvertSelection(true);
filterX.setExpression(Pattern.quote("^.*i$"));
//filterX.setExpression("^.*i$"); also don't work
Instances dataset1_=Filter.useFilter(dataset1_x,filterX);

This should match all names ending with an "i".

The resulting dataset is named

"dataset-weka.filters.unsupervised.attribute.StringToNominal-Rlast-weka.filters.unsupervised.attribute.Remove-weka.filters.unsupervised.attribute.RemoveByName-E^.*id$"

Note that ^.*id$ is the default expression. It has not changed.

Although filterX.getExpression(); gives the correct regex set before. Also this usage of the filter corresponds to several code-examples. Same if I set the regex using Filter.setOptions(); This is an issue of version 3.9.0 dev and also 3.8 stable.

Using the WEKA-GUI, the filter is working correctly.

Thus another assumption is that if entered programmatically, the regex must have a special format.. Unfortunately the API does not provide examples..


Solution

  • You need to set the expression and the InvertSelection-flag before setting the input format.

    More generally i assume that you have to set all option before setting the inputFormat.

    Following is working.

    Instances dataset1_x=new Instances(dataset1);
    RemoveByName filterX=new RemoveByName();
    filterX.setInvertSelection(true);
    filterX.setExpression(Pattern.quote("^.*i$"));
    filterX.setInputFormat(dataset1_x);
    Instances dataset1_=Filter.useFilter(dataset1_x,filterX);