i am working on single label text categrorization with a dataset of reuter-21578 however the dataset is multi-label by default. Many researchers removed multilabel instances from thi dataset and their number of instances in reuters categories is quite different than mine. How can i remove all the instance that belongs to more than one category in a dataset ? Can i use weka or Rapidminer for this purpose to identify multilabel instances in a dataset ?
Example:
Input Dataset = {x1, x2, x3, x4, x5, x6, x7, x8, x9, x10} Labels = {acq, earn, grain , corn} Classification Results = x1, x2, x3 = acq x4, x5 = earn x6, x7, x8 = grain x9 = grain, corn x10 = grain, acq Output Dataset (what i want) = output dataset = {x1, x2, x3, x4, x5, x6, x7, x8} output labels = {acq, earn, grain, corn} Classification Results = x1, x2, x3 = acq x4, x5 = earn x6, x7, x8 = grain **OR** {This is what i assume i have achieved with PolynomiaByBinomial Operator } output dataset = {x1, x2, x3, x4, x5, x6, x7, x8, x9, x10} output labels = {acq, earn, grain, corn} Classification Results = x1, x2, x3 = acq x4, x5 = earn x6, x7, x8, x9, x10 = grain x9 = grain x10 = grain
Thanks in advance
The simplest way is to break the dataset into binary problems. If for example you have the datasets
text1: science
text2: sports, politics
Break the dataset into 3 datasets:
dataset1 (science): text1:true, text2:false
dataset2 (sports): text2:false, text2:true
dataset3 (science): text1:false, text2:true
Create 3 binary classifiers, one for each class, use the corresponding datasets to train them, and combine the results.