I am using Lucene 4.2 and would like to know how wordnet can be used to expand an input query for this version of Lucene. Basically, if my query is like
term_1 AND term_2 OR term_3
I would like it to be expanded as
(term_1 OR term_1syn_1 OR term_1syn_2) AND (term_2 OR term_2syn_1) OR (term_3 OR term_3syn_1)
and so on.
I looked at other answers on StackoverFlow for this kind of question, but none of them have any sample implementation.
Given an input query in form of a string, how can I expand it using the WordNetQueryParser and SynonymMap classes?
I have already downloaded the wordnet prolog file and I know that the _s.pl file has all the synonyms.
Any sample code would be highly appreciated.
A SynonymFilter allows you to define a SynonymnMap to a simple Custom Analyzer.
You can create a custom Analyzer by just overriding Analyzer.createComponents, and pass the custom version to both the IndexWriter and the QueryParser, when writing to and searching respectively.
One thing to consider, your case involves exploding out all possible synonyms, which will mean passing includeOrig to true in Builder.add. There are benefits either way here, might look into which will actually serve your needs best.
Lucene's Analyzer
is designed to be readily extended to define the formatting for your particular case easily. The Analyzer
API documentation linked above provides an example of overriding the createComponents method for your custom Analyzer.
Something like:
protected TokenStreamComponents createComponents(String fieldName, Reader reader) {
Tokenizer source = new ClassicTokenizer(Version.LUCENE_40, reader);
TokenStream filter = new StandardFilter(Version.LUCENE_40, source);
filter = new LowerCaseFilter(Version.LUCENE_40,filter);
filter = new SynonymFilter(filter, mySynonymMap, false);
//Whatever other filter you want to add to the chain, being mindful of order.
return new TokenStreamComponents(source, filter);
}
And you'll need to define mySynonymMap, from the example, which is a SynonymnMap
. The SynonymMap
should generally be built by the SynonymMap.Builder
, via the add(CharsRef, CharsRef, boolean)
method linked above.
SynonymMap.Builder builder = new SynonymMap.Builder(true);
builder.add(new CharsRef("crimson"), new CharsRef("red"), true);
//Be sure the boolean last arg you pass there is the one you want. There are significant tradeoffs here.
//Add as many terms as you like here...
SynonymMap mySynonymMap = builder.build();
There is also a WordNetSynonymParser
, if you prefer that, which looks like just a SynonymMap.Builder designed to read a particular sort of specification, at a glance.