Search code examples
machine-learningwekadata-miningrapidminertext-classification

Converting Multilabel dataset into Single Label?


i am working on single label text categrorization with a dataset of reuter-21578 however the dataset is multi-label by default. Many researchers removed multilabel instances from thi dataset and their number of instances in reuters categories is quite different than mine. How can i remove all the instance that belongs to more than one category in a dataset ? Can i use weka or Rapidminer for this purpose to identify multilabel instances in a dataset ?

Example:


    Input Dataset = {x1, x2, x3, x4, x5, x6, x7, x8, x9, x10}
    Labels = {acq, earn, grain , corn}


    Classification Results = 

    x1, x2, x3 = acq
    x4, x5 = earn
    x6, x7, x8 = grain
    x9 = grain, corn
    x10 = grain, acq

    Output Dataset (what i want) = 
    output dataset = {x1, x2, x3, x4, x5, x6, x7, x8}
    output labels = {acq, earn, grain, corn}

    Classification Results = 

    x1, x2, x3 = acq
    x4, x5 = earn
    x6, x7, x8 = grain

    **OR**
    {This is what i assume i have achieved with PolynomiaByBinomial Operator }
    output dataset = {x1, x2, x3, x4, x5, x6, x7, x8, x9, x10}
    output labels = {acq, earn, grain, corn}
    Classification Results = 

    x1, x2, x3 = acq
    x4, x5 = earn
    x6, x7, x8, x9, x10 = grain
    x9 = grain
    x10 = grain

Thanks in advance


Solution

  • The simplest way is to break the dataset into binary problems. If for example you have the datasets

    text1: science
    text2: sports, politics
    

    Break the dataset into 3 datasets:

    dataset1 (science): text1:true, text2:false
    dataset2 (sports): text2:false, text2:true
    dataset3 (science): text1:false, text2:true
    

    Create 3 binary classifiers, one for each class, use the corresponding datasets to train them, and combine the results.