Search code examples
javawekarandom-forestlibsvmtext-classification

StringToWordVectore error in java for text classification


1- I try to apply StringToWordVector filter into text by java coding, but it does not work. The output of the filter is incorrect. the code that I used:

Instances instances = source.getDataSet();
instances.setClassIndex(instances.numAttributes()-1);
StringToWordVector stwv = new StringToWordVector();
//Splits a string into an n-gram with min and max grams.
NGramTokenizer tokenizer = new NGramTokenizer();
tokenizer.setNGramMinSize(1);
tokenizer.setNGramMaxSize(1);
tokenizer.setDelimiters(" \r\n\t.,;:'\"()?!'");
stwv.setTokenizer(tokenizer);

stwv.setDoNotOperateOnPerClassBasis(true);
stwv.setOutputWordCounts(true);
stwv.setDictionaryFileToSaveTo(new File("/forEclips/RandomForset/DictionaryFile.txt"));
//------------------------
stwv.setInputFormat(instances);
// Apply the filter
Instances dataFiltered = weka.filters.Filter.useFilter(instances, stwv);
System.out.println("\n\nFiltered data:\n\n" + dataFiltered.toString() );

The Output looks like:

@relation 'DIMS-weka.filters.unsupervised.attribute.StringToWordVector-R1-W10-prune-rate-1.0-C-N0-stemmerweka.core.stemmers.NullStemmer-stopwords-handlerweka.core.stopwords.Null-M1-O-tokenizerweka.core.tokenizers.NGramTokenizer -max 1 -min 1 -delimiters \" \\r\\n\\t.,;:\\\'\\\"()?!\\\'\"-dictionary/forEclips/RandomForset/DictionaryFile.txt 
@attribute class {Di,MS}
@attribute اشبو numeric
@attribute اللي numeric
@attribute المويه numeric
@attribute النار numeric
@attribute تشوفوا numeric
@attribute تعرفون numeric
@attribute حبايبي numeric
@attribute حجازي numeric
@attribute خلال numeric
@attribute دي numeric
@attribute زي numeric
@attribute سيدي numeric
@attribute صور numeric
@attribute في numeric
@attribute كتير numeric
@attribute كتييير numeric
@attribute كتيييير numeric
@attribute كده numeric
@attribute مثل numeric
@attribute من numeric
@attribute مو numeric
@attribute هل numeric
@attribute وعيشوا numeric
@attribute وقدود، numeric
@attribute يا numeric
@attribute يده numeric

@data
{0 MS,9 1,13 3,20 2}
{0 MS,9 3,13 1,20 2}
{0 MS,6 1,22 1}
{5 1,16 1,17 1,23 1,24 1}
{2 2,3 1,4 1,8 1,11 1,14 2,19 1,21 1,26 2}
{1 1,7 1,10 1,12 1,15 1,18 1,20 1,25 1}`

We can see here it does not put the class at the end in the section @attribute.In addition, in section @data, the first three instances, the class in the first, while the last three, do not any class and class's id. It should be at the end the class with it is id.

2- Also, I want to add an attribute (newattribut) with type numeric for all the instances that I have with the same weight(value =44).
that mean the section @attribute will look like:

   @attribute اشبو numeric
   @attribute اللي numeric
   @attribute المويه numeric
   @attribute النار numeric
   @attribute تشوفوا numeric
   @attribute تعرفون numeric
   @attribute حبايبي numeric
   @attribute حجازي numeric
   @attribute خلال numeric
   @attribute دي numeric
   @attribute زي numeric
   @attribute سيدي numeric
   @attribute صور numeric
   @attribute في numeric
   @attribute كتير numeric
   @attribute كتييير numeric
   @attribute كتيييير numeric
   @attribute كده numeric
   @attribute مثل numeric
   @attribute من numeric
   @attribute مو numeric
   @attribute هل numeric
   @attribute وعيشوا numeric
   @attribute وقدود، numeric
   @attribute يا numeric
   @attribute يده numeric
   @attribute newattribute numeric
   @attribute class {Di,MS}


   @data
   {8 1,12 3,19 2,26 44,27 MS}
   {8 3,12 1,19 2,26 44, 27 MS}
   {5 1,21 1,26 44,27 MS}
   {4 1,15 1,16 1,22 1,23 1,26 44,27 Di}
   {1 2,2 1,3 1,7 1,10 1,13 2,18 1,20 1,25 2,26 44,27 Di}
   {0 1,6 1,9 1,11 1,14 1,17 1,19 1,24 1,26 44,27 Di}

3- I want to use this training data to classify the text by Naive baise, Random Forest, and SVM. How to build Cross-validation for training and testing data by using weka library in java. I try to use SVM by adding Libsvm in java building path put it gives me an error.

Regards;


Solution

  • I found these websites very useful to do text classification with filter StringToWordVector. http://www.uky.edu/~nyu222/tutorials/Weka.htm https://www.youtube.com/watch?v=Tggs3Bd3ojQ&list=PLm4W7_iX_v4OMSgc8xowC2h70s-unJKCp&index=11