Search code examples
pythonweka

Modifying columns of weka.core.dataset.Instances object


I'm using python-weka-wrapper3. I have just loaded an arff dataset

kc1_class_arff = arff_loader(DATA_PATH, "/kc1_class.arff")

The last column of this dataset is named NUMDEFECTS and contains float.

I need this column to be renamed as DEFECTS and be turned to integers:

  • 1 if NUMDEFECTS != 0
  • 0 if NUMDEFECTS = 0

The function arff_loader is the following:

def arff_loader(DATA_PATH, file_name):
    data = loader.load_file(DATA_PATH + file_name)
    data.class_is_last()
    return data

Solution

  • You can achieve this by creating a short filter pipeline:

    • For renaming the attribute, you can use the RenameAttribute filter.

    • For turning the numeric attribute into an indicator attribute, you can use the MathExpression filter.

    • For combining multiple filters, use MultiFilter.

    Assuming an input file like:

    @relation defects
    
    @attribute x1 numeric
    @attribute x2 numeric
    @attribute NUMDEFECTS numeric
    
    @data
    0,1,12.0
    1,1,25.1
    0,0,5.0
    1,0,0.0
    -1,0,-10.0
    

    You can apply this python-weka-wrapper3 code:

    import weka.core.jvm as jvm
    from weka.core.converters import load_any_file
    from weka.filters import Filter, MultiFilter
    
    jvm.start()
    
    # load data
    data = load_any_file("./defects.arff")
    
    # rename
    rename = Filter(classname="weka.filters.unsupervised.attribute.RenameAttribute",
                    options=["-find", "NUMDEFECTS", "-replace", "DEFECTS"])
    
    # float -> indicator
    # NB: class attribute must unset for this filter to work!
    indicator = Filter(classname="weka.filters.unsupervised.attribute.MathExpression",
                       options=["-E", "ifelse(A < 0, 1, ifelse(A > 0, 1, 0))", "-R", "last", "-V"])
    
    multi = MultiFilter()
    multi.filters = [rename, indicator]
    multi.inputformat(data)
    filtered = multi.filter(data)
    filtered.relationname = data.relationname
    print(filtered)
    
    jvm.stop()
    

    And you will get something like this:

    @relation defects
    
    @attribute x1 numeric
    @attribute x2 numeric
    @attribute DEFECTS numeric
    
    @data
    0,1,1
    1,1,1
    0,0,1
    1,0,0
    -1,0,1
    

    Once you have obtained the filtered data, you can set the class attribute and return it in your arff_loader method.