Search code examples
pythondata-analysismultilabel-classificationarff

How to read Sparse ARFF data using Python libraries?


The data section is something like this: {60 1,248 1,279 1,316 1}. When I use Python LIAC-ARFF library, I get error like this: ValueError: {60 1 value not in ('0', '1').

When I use normal ARFF file, it works fine.

I am using the famous delicious.arff dataset from MULAN site.

Is there any other method I need to use? Can anyone help?


Solution

  • You can use the function scikit-multilearn provides for loading ARFF data.

    Example of how to use - the first argument is the ARFF file and the format is MULAN so labels are at the end (label_location="end"). There are 983 labels in the delicious data set and the features of delicious input data are integers and input data is already nominal as the input space in delicious is a bag of words. Remember, you should always read what the data set is in the relevant paper (source paper information for data sets is provided on the MULAN site):

    from skmultilearn.dataset import load_from_arff
    
    X, y = load_from_arff("/home/user/data/delicious-train.arff", 
        # number of labels
        labelcount=983, 
        # MULAN format, labels at the end of rows in arff data, using 'end' for label_location
        # 'start' is also available for MEKA format
        label_location='end', 
        # bag of words
        input_feature_type='int', encode_nominal=False, 
        # sometimes the sparse ARFF loader is borked, like in delicious,
        # scikit-multilearn converts the loaded data to sparse representations, 
        # so disabling the liac-arff sparse loader
        # but you may set load_sparse to True if this fails
        load_sparse=False, 
        # this decides whether to return attribute names or not, usually 
        # you don't need this
        return_attribute_definitions=False)
    

    What is returned?

    >>> print(X, y)
    (<12920x500 sparse matrix of type '<type 'numpy.int64'>' with 6460000 stored elements in LInked List format>,
    <12920x983 sparse matrix of type '<type 'numpy.int64'>' with 12700360 stored elements in LInked List format>)