So far I have:
NGramTokenizer tokenizer = new NGramTokenizer();
tokenizer.setNGramMinSize(2);
tokenizer.setNGramMaxSize(2);
tokenizer.setDelimiters("[\\w+\\d+]");
StringToWordVector filter = new StringToWordVector();
// customize filter here
Instances data = Filter.useFilter(input, filter);
The API has these two methods for StringToWordVector:
setStemmer(Stemmer value);
setStopwordsHandler(StopwordsHandler value);
I have a text file containing the stopwords and another class that stems words. How do I use a custom stemmer and stopwords filter? Note that the I'm taking phrases of size 2, so I can't preprocess and remove all stopwords beforehand.
Update: This worked for me (using Weka developer version 3.7.12)
To use a custom stopwords handler:
public class MyStopwordsHandler implements StopwordsHandler {
private HashSet<String> myStopwords;
public MyStopwordsHandler() {
//Load in your own stopwords, etc.
}
//Must implement this method from the StopwordsHandler interface
public Boolean isStopword(String word) {
return myStopwords.contains(word);
}
}
To use a custom stemmer, create a class that implements the Stemmer interface and write the implementations for these methods:
public String stem(String word) { ... }
public String getRevision() { ... }
Then to use your custom stopwords handler and stemmer:
StringToWordVector filter = new StringToWordVector();
filter.setStemmer(new MyStemmer());
filter.setStopwordsHandler(new MyStopwordsHandler());
Note: The answer below by Thusitha works for the stable 3.6 verion, and it is much simpler than the one described above. But I could not get it to work with the 3.7.12 version.
In the latest weka library you can use
StringToWordVector filter = new StringToWordVector();
filter.setStopwords(new File("filename"));
I'm using following dependency
<dependency>
<groupId>nz.ac.waikato.cms.weka</groupId>
<artifactId>weka-stable</artifactId>
<version>3.6.12</version>
</dependency>
In the API docs API Doc
public void setStopwords(java.io.File value) sets the file containing the stopwords, null or a directory unset the stopwords. If the file exists, it automatically turns on the flag to use the stoplist. Parameters: value - the file containing the stopwords