I'm creating TFIDF vectors using Apache Mahout. I specify EnglishAnalyzer as part of document tokenizing like so:
DocumentProcessor.tokenizeDocuments(documentsSequencePath, EnglishAnalyzer.class, tokenizedDocumentsPath, configuration);
which gives me the following vector for a document I've called business.txt
. I was surprised to see useless words in there like have
, on
, i
, e.g.
. One of my other documents has loads more.
What is the simplest way for me to improve the quality of the terms it's finding? I know EnglishAnalyzer can be passed a stop word list but the constructor gets invoked via reflection so it seems like I can't do that.
Should I write my own Analyzer? I'm a bit confused about how to compose tokenizers, filters etc. Can I reuse EnglishAnalyzer along with my own filters? Subclassing EnglishAnalyzer doesn't seem to be possible this way.
# document: tfidf-score term
business.txt: 109 comput
business.txt: 110 us
business.txt: 111 innov
business.txt: 111 profit
business.txt: 112 market
business.txt: 114 technolog
business.txt: 117 revolut
business.txt: 119 on
business.txt: 119 platform
business.txt: 119 strategi
business.txt: 120 logo
business.txt: 121 i
business.txt: 121 pirat
business.txt: 123 econom
business.txt: 127 creation
business.txt: 127 have
business.txt: 128 peopl
business.txt: 128 compani
business.txt: 134 idea
business.txt: 139 luxuri
business.txt: 139 synergi
business.txt: 140 disrupt
business.txt: 140 your
business.txt: 141 piraci
business.txt: 145 product
business.txt: 147 busi
business.txt: 168 funnel
business.txt: 176 you
business.txt: 186 custom
business.txt: 197 e.g
business.txt: 301 brand
You can pass a custom stop word set to the EnglishAnalyzer ctor. It is typical for this stop word list to be loaded from a file, which is plain text with one stop word per line. That would look something like this:
String stopFileLocation = "\\path\\to\\my\\stopwords.txt";
CharArraySet stopwords = StopwordAnalyzerBase.loadStopwordSet(
Paths.get(StopFileLocation));
EnglishAnalyzer analyzer = new EnglishAnalyzer(stopwords);
I don't, right off, see how you are supposed to pass ctor arguments to the Mahout method you've indicated. I don't really know Mahout. If you aren't able to, then yes, you could create a custom analyzer by copying EnglishAnalyzer
, and load your own stopwords there. Here's an example that loads a custom stop word list from a file, and no stem exclusions (that is, removed the stem exclusion stuff, for brevity's sake).
public final class EnglishAnalyzerCustomStops extends StopwordAnalyzerBase {
private static String StopFileLocation = "\\path\\to\\my\\stopwords.txt";
public EnglishAnalyzerCustomStops() throws IOException {
super(StopwordAnalyzerBase.loadStopwordSet(Paths.get(StopFileLocation)));
}
protected TokenStreamComponents createComponents(String fieldName) {
final Tokenizer source = new StandardTokenizer();
TokenStream result = new StandardFilter(source);
result = new EnglishPossessiveFilter(result);
result = new LowerCaseFilter(result);
result = new StopFilter(result, stopwords);
result = new PorterStemFilter(result);
return new TokenStreamComponents(source, result);
}
protected TokenStream normalize(String fieldName, TokenStream in) {
TokenStream result = new StandardFilter(in);
result = new LowerCaseFilter(result);
return result;
}
}