Search code examples
lucenemahouttf-idf

EnglishAnalyzer with better stop world filtering?


I'm creating TFIDF vectors using Apache Mahout. I specify EnglishAnalyzer as part of document tokenizing like so:

DocumentProcessor.tokenizeDocuments(documentsSequencePath, EnglishAnalyzer.class, tokenizedDocumentsPath, configuration); 

which gives me the following vector for a document I've called business.txt. I was surprised to see useless words in there like have, on, i, e.g.. One of my other documents has loads more.

What is the simplest way for me to improve the quality of the terms it's finding? I know EnglishAnalyzer can be passed a stop word list but the constructor gets invoked via reflection so it seems like I can't do that.

Should I write my own Analyzer? I'm a bit confused about how to compose tokenizers, filters etc. Can I reuse EnglishAnalyzer along with my own filters? Subclassing EnglishAnalyzer doesn't seem to be possible this way.

# document: tfidf-score term
business.txt: 109 comput
business.txt: 110 us
business.txt: 111 innov
business.txt: 111 profit
business.txt: 112 market
business.txt: 114 technolog
business.txt: 117 revolut
business.txt: 119 on
business.txt: 119 platform
business.txt: 119 strategi
business.txt: 120 logo
business.txt: 121 i
business.txt: 121 pirat
business.txt: 123 econom
business.txt: 127 creation
business.txt: 127 have
business.txt: 128 peopl
business.txt: 128 compani
business.txt: 134 idea
business.txt: 139 luxuri
business.txt: 139 synergi
business.txt: 140 disrupt
business.txt: 140 your
business.txt: 141 piraci
business.txt: 145 product
business.txt: 147 busi
business.txt: 168 funnel
business.txt: 176 you
business.txt: 186 custom
business.txt: 197 e.g
business.txt: 301 brand

Solution

  • You can pass a custom stop word set to the EnglishAnalyzer ctor. It is typical for this stop word list to be loaded from a file, which is plain text with one stop word per line. That would look something like this:

    String stopFileLocation = "\\path\\to\\my\\stopwords.txt"; 
    CharArraySet stopwords = StopwordAnalyzerBase.loadStopwordSet(
            Paths.get(StopFileLocation));
    EnglishAnalyzer analyzer = new EnglishAnalyzer(stopwords);
    

    I don't, right off, see how you are supposed to pass ctor arguments to the Mahout method you've indicated. I don't really know Mahout. If you aren't able to, then yes, you could create a custom analyzer by copying EnglishAnalyzer, and load your own stopwords there. Here's an example that loads a custom stop word list from a file, and no stem exclusions (that is, removed the stem exclusion stuff, for brevity's sake).

    public final class EnglishAnalyzerCustomStops extends StopwordAnalyzerBase {
      private static String StopFileLocation = "\\path\\to\\my\\stopwords.txt"; 
    
      public EnglishAnalyzerCustomStops() throws IOException {
        super(StopwordAnalyzerBase.loadStopwordSet(Paths.get(StopFileLocation)));
      }
    
      protected TokenStreamComponents createComponents(String fieldName) {
        final Tokenizer source = new StandardTokenizer();
        TokenStream result = new StandardFilter(source);
        result = new EnglishPossessiveFilter(result);
        result = new LowerCaseFilter(result);
        result = new StopFilter(result, stopwords);
        result = new PorterStemFilter(result);
        return new TokenStreamComponents(source, result);
      }
    
      protected TokenStream normalize(String fieldName, TokenStream in) {
        TokenStream result = new StandardFilter(in);
        result = new LowerCaseFilter(result);
        return result;
      }
    }