Search code examples
javaweka

How do I use custom stopwords and stemmer file in WEKA (Java)?


So far I have:

NGramTokenizer tokenizer = new NGramTokenizer();
tokenizer.setNGramMinSize(2);
tokenizer.setNGramMaxSize(2); 
tokenizer.setDelimiters("[\\w+\\d+]");

StringToWordVector filter = new StringToWordVector();
// customize filter here
Instances data = Filter.useFilter(input, filter);

The API has these two methods for StringToWordVector:

setStemmer(Stemmer value);
setStopwordsHandler(StopwordsHandler value);

I have a text file containing the stopwords and another class that stems words. How do I use a custom stemmer and stopwords filter? Note that the I'm taking phrases of size 2, so I can't preprocess and remove all stopwords beforehand.

Update: This worked for me (using Weka developer version 3.7.12)

To use a custom stopwords handler:

public class MyStopwordsHandler implements StopwordsHandler {

    private HashSet<String> myStopwords;

    public MyStopwordsHandler() {
        //Load in your own stopwords, etc.
    }

    //Must implement this method from the StopwordsHandler interface
    public Boolean isStopword(String word) {
        return myStopwords.contains(word); 
    }

}

To use a custom stemmer, create a class that implements the Stemmer interface and write the implementations for these methods:

public String stem(String word) { ... }
public String getRevision() { ... } 

Then to use your custom stopwords handler and stemmer:

StringToWordVector filter = new StringToWordVector();
filter.setStemmer(new MyStemmer());
filter.setStopwordsHandler(new MyStopwordsHandler());

Note: The answer below by Thusitha works for the stable 3.6 verion, and it is much simpler than the one described above. But I could not get it to work with the 3.7.12 version.


Solution

  • In the latest weka library you can use

    StringToWordVector filter = new StringToWordVector();
    filter.setStopwords(new File("filename"));
    

    I'm using following dependency

    <dependency>
       <groupId>nz.ac.waikato.cms.weka</groupId>
       <artifactId>weka-stable</artifactId>
       <version>3.6.12</version>
    </dependency>
    

    In the API docs API Doc

    public void setStopwords(java.io.File value) sets the file containing the stopwords, null or a directory unset the stopwords. If the file exists, it automatically turns on the flag to use the stoplist. Parameters: value - the file containing the stopwords