Search code examples
lucenesynonymphrase

Can someone assist me with a multi-word synonym problem in Lucene?


Simple synonyms (wordA = wordB) are fine. When the synonym is a phrase (wordA = wordB word C), then matching is hit-or-miss.

I have a simple test case (it's delivered as an Ant project) which illustrates the problem. This test case uses the same files as the other question I posted today, but I'll give the same description here.

Materials

You can download the test case here: mydemo.with.libs.zip (5MB)

That archive includes the Lucene 9.2 libraries which my test uses; if you prefer a copy without the JAR files you can download that from here: mydemo.zip (9KB)

You can run the test case by unzipping the archive into an empty directory and running the Ant command ant stsearch

Input

When indexing the documents, the following synonym list is used (permuted as necessary):

note,notes,notice,notification
subtree,sub tree,sub-tree

I have three documents, each containing a single sentence. The three sentences are:

These release notes describe a document sub tree in a simple way.
This release note describes a document subtree in a simple way.
This release notice describes a document sub-tree in a simple way.

Problem

I believe that any of the following searches should match all three documents:

subtree
sub tree
sub-tree
"document subtree"
"document sub tree"
"document sub-tree"

While the searches for subtree and sub-tree match correctly, the search for sub tree only matches a single document (the one which literally contains sub tree as two words).

The phrase searches are incorrect: "document subtree" and "document sub tree" each match one, and "document sub-tree" matches two.

If I add a proximity modifier to the phrase searches, like so:

"document subtree"~1
"document sub tree"~1
"document sub-tree"~1

the first and third now match all three records, but "document sub tree"~1 still only matches the one document.

The pairing of a two-word phrase as a synonym of a single word just isn't working.

Here's my analyzer including the synonym map builder:

public class MyAnalyzer extends Analyzer {
   public MyAnalyzer(String synlist) {
      this.synlist = synlist;
   }

   @Override
   protected TokenStreamComponents createComponents(String fieldName) {
      WhitespaceTokenizer src = new WhitespaceTokenizer();
      TokenStream result = new LowerCaseFilter(src);
      if (synlist != null) {
         result = new SynonymGraphFilter(result, getSynonyms(synlist), Boolean.TRUE);
         result = new FlattenGraphFilter(result);
      }
      return new TokenStreamComponents(src, result);
   }

   private static SynonymMap getSynonyms(String synlist) {
      boolean dedup = Boolean.TRUE;
      SynonymMap synMap = null;
      SynonymMap.Builder builder = new SynonymMap.Builder(dedup);
      int cnt = 0;

      try {
         BufferedReader br = new BufferedReader(new FileReader(synlist));
         String line;
         try {
            while ((line = br.readLine()) != null) {
               processLine(builder,line);
               cnt++;
            }
         } catch (IOException e) {
            System.err.println(" caught " + e.getClass() + " while reading synonym list,\n with message " + e.getMessage());
         }
         System.out.println("Synonym load processed " + cnt + " lines");
         br.close();
      } catch (Exception e) {
         System.err.println(" caught " + e.getClass() + " while loading synonym map,\n with message " + e.getMessage());
      }
      if (cnt > 0) {
         try {
            synMap = builder.build();
         } catch (IOException e) {
            System.err.println(e);
         }
      }
      return synMap;
   }

   private static void processLine(SynonymMap.Builder builder, String line) {
      boolean keepOrig = Boolean.TRUE;
      String terms[] = line.split(",");
      if (terms.length < 2) {
         System.err.println("Synonym input must have at least two terms on a line: " + line);
      } else {
         String word = terms[0];
         String[] synonymsOfWord = Arrays.copyOfRange(terms, 1, terms.length);
         addSyns(builder, word, synonymsOfWord, keepOrig);
      }
   }

   private static void addSyns(SynonymMap.Builder builder, String word, String[] syns, boolean keepOrig) {
      CharsRefBuilder synset = new CharsRefBuilder();
      SynonymMap.Builder.join(syns, synset);
      CharsRef wordp = SynonymMap.Builder.join(word.split("\\s+"), new CharsRefBuilder());
      builder.add(wordp, synset.get(), keepOrig);
   }

   private String synlist;
}

I suspect I have to do some additional manipulation of the synonymsOfWord array, but nothing I've tried has worked.

Note that the analyzer includes synonyms when building the index, and not when it is executing a query.


Solution

  • I do not know if this is the best solution, but it is a solution.

    It is basically a very similar approach to the answer to this related question, but with an enhancement to handle synonyms, some of which contain multiple words:

    "subtree", "sub tree", "sub-tree"
    

    In this case, the synonym builder needs to make use of SynonymMap.html#WORD_SEPARATOR:

    "for multiword support, you must separate words with this separator"

    It's just a char containing the null terminator \u0000.

    Therefore you can do something quick and dirty as follows:

    String[] synonyms = {"sub tree", "sub-tree", "subtree"};
    int len = synonyms.length;
    String sep = Character.toString(SynonymMap.WORD_SEPARATOR);
    String[] luceneSyns = new String[len];
    for (int i = 0; i < len; i++) {
        luceneSyns[i] = synonyms[i].replaceAll(" ", sep).replaceAll("-", sep);
    }
    

    And now luceneSyns becomes the array we use:

    // build a synonym map where every word or phrase in the list is a synonym
    // of every other word or phrase in the list:
    SynonymMap.Builder synMapBuilder = new SynonymMap.Builder(dedup);
    for (String word : luceneSyns) {
        for (String synonym : luceneSyns) {
            if (!synonym.equals(word)) {
                //System.out.println(word + " > " + synonym);
                synMapBuilder.add(new CharsRef(word), new CharsRef(synonym), includeOrig);
            }
        }
    }
    

    This works.

    All the queries listed in the question will find all three documents.


    The above approach is not pretty - it assumes you will only ever need to handle a space and a dash as the two characters which need to be replaced by the null terminator.

    Another more robust approach is probably to use SynonymMap.Parser, which has a parse() method for converting your provided synonym text into the text needed for phrase synonyms.

    This is an abstract class and I do not know how to implement the analyze() method correctly - but here is as far as I got:

    First I created the class MySynonymParser:

    import java.io.IOException;
    import java.io.Reader;
    import java.text.ParseException;
    import org.apache.lucene.analysis.Analyzer;
    import org.apache.lucene.analysis.synonym.SynonymMap;
    import org.apache.lucene.util.CharsRef;
    import org.apache.lucene.util.CharsRefBuilder;
    
    public class MySynonymParser extends SynonymMap.Parser {
        
        private final boolean dedup;
        private final Analyzer analyzer;
        
        public MySynonymParser(boolean dedup, Analyzer analyzer) {
            this.dedup = dedup;
            this.analyzer = analyzer;
        }
    
        @Override
        public CharsRef analyze​(String text, CharsRefBuilder reuse) throws IOException {
            // implementation here
            return null;
        } 
        
        @Override
        public void parse(Reader reader) throws IOException, ParseException {
            throw new UnsupportedOperationException("Not supported yet."); 
        }
        
    }
    

    As mentioned, the needed analyze​() method is missing its implementation.

    I assume that method would have to capture the analyzed output for the provided input string, and then replace any spaces with null terminators - and return that new string as a CharsRef. I have not attempted this piece - I'm not even sure if this is what is actually needed.

    But assuming it's correctly implemented, then I assume it would be used as follows:

    Analyzer analyzer = new Analyzer() {
        @Override
        protected Analyzer.TokenStreamComponents createComponents(String fieldName) {
            Tokenizer source = new StandardTokenizer();
            TokenStream tokenStream = source;
            tokenStream = new LowerCaseFilter(tokenStream);
            tokenStream = new ASCIIFoldingFilter(tokenStream);
            return new Analyzer.TokenStreamComponents(source, tokenStream);
        }
    };
    MySynonymParser mySynonymParser = new MySynonymParser(dedup, analyzer);
    CharsRefBuilder charsRefBuilder = new CharsRefBuilder();
    mySynonymParser.analyze(sep, charsRefBuilder);
    // build a synonym map where every word in the list is a synonym
    // of every other word in the list:
    SynonymMap.Builder synMapBuilder2 = new SynonymMap.Builder(dedup);
    for (String word : luceneSyns) {
        for (String synonym : luceneSyns) {
            if (!synonym.equals(word)) {
                synMapBuilder2.add(mySynonymParser.analyze(word, charsRefBuilder), 
                        mySynonymParser.analyze(synonym, charsRefBuilder), includeOrig);
            }
        }
    }
    

    In the above code, we have to create an analyzer to pass to MySynonymParser. This analyzer is the same as the one we actually use for indexing, but without the synonym filters.

    Then we analyze each word and synonym which replaces all spaces with the null terminator.