I am new to Weka and found there's only little document on the filter function.
Actually, I have 2 attribute i.e profit
, cost
. I would like to create a new attribute name result
which compares profit
and cost
attributes and , for each row value, result
label will be gain
if profit
> cost
, otherwise the label will be 'loss'.
I am using Weka Explorer UI. And I tried Copy
and MergeTwoValue
filter but seems it can't do the comparison step. What would be the right step?
Such a comparison is possible using the MathExpression filter as part of filter processing pipeline, using the ifelse
construct. However, MathExpression does not allow you to use labels, but you can have an indicator label 0
or 1
to indicate whether gain or loss.
MultiFilter
|
+- Add (we insert a new numeric attribute, all missing values)
|
+- ReplaceMissingWithUserConstant (MathExpression skips missing values, hence replacing them in our new attribute)
|
+- MathExpression (the actual comparison between the two attributes)
|
+- NumericToNominal (to turn the numeric 0/1 values into labels)
I will demonstrate how to construct this pipeline using the bolts UCI dataset, which has the following attributes:
1 RUN numeric
2 SPEED1 numeric
3 TOTAL numeric
4 SPEED2 numeric
5 NUMBER2 numeric
6 SENS numeric
7 TIME numeric
8 T20BOLT numeric
For this example, I want to compare SENS
and TIME
, creating an indicator whether SENS > TIME
.
MultiFilter
The MultiFilter instance combines all our sub-filters into a single filter setup. That way you can easily apply, extend it or use it within a FilteredClassifier setup.
Add
First, we will add an attribute using the Add filter at index 8
, which will push the class attribute to position 9
, giving it the name SENS>TIME
(you can give it any name you want):
weka.filters.unsupervised.attribute.Add -N SENS>TIME -C 8
ReplaceMissingWithUserConstant
Next, we use the ReplaceMissingValueUserConstant filter to replace the missing values in our attribute (index 8
) with a dummy value, e.g., -1
. This is unfortunately necessary, since MathExpression
does not operate on missing values.
weka.filters.unsupervised.attribute.ReplaceMissingWithUserConstant -A 8 -R -1 -F "yyyy-MM-dd\'T\'HH:mm:ss"
MathExpression
With the stage set, we can now use MathExpression to fill in our comparison using the expression ifelse(A6>A7,1,0)
:
weka.filters.unsupervised.attribute.MathExpression -E ifelse(A6>A7,1,0) -V -R 8
If attribute 6
(SENS
) is greater than attribute 7
(TIME
), then insert a 1
otherwise a 0
.
NumericToNominal
With the NumericToNominal filter we will turn the numeric indicators in our comparison attribute into nominal labels:
weka.filters.unsupervised.attribute.NumericToNominal -R 8
Bonus
If you want to use the labels gain
/loss
instead of 1
/0
, then you can add the RenameNominalValues filter at the end of the pipeline.