Search code examples
lucenesynonym

I have synonym matching working EXCEPT in quoted phrases


Simple synonyms (wordA = wordB) are fine. When there are two or more synonyms (wordA = wordB = wordC ...), then phrase matching is only working for the first, unless the phrases have proximity modifiers.

I have a simple test case (it's delivered as an Ant project) which illustrates the problem.

Materials

You can download the test case here: mydemo.with.libs.zip (5MB)

That archive includes the Lucene 9.2 libraries which my test uses; if you prefer a copy without the JAR files you can download that from here: mydemo.zip (9KB)

You can run the test case by unzipping the archive into an empty directory and running the Ant command ant rnsearch

Input

When indexing the documents, the following synonym list is used (permuted as necessary):

note,notes,notice,notification
subtree,sub tree,sub-tree

I have three documents, each containing a single sentence. The three sentences are:

These release notes describe a document sub tree in a simple way.
This release note describes a document subtree in a simple way.
This release notice describes a document sub-tree in a simple way.

Problem

I believe that any of the following searches should match all three documents:

release note
release notes
release notice
release notification
"release note"
"release notes"
"release notice"
"release notification"

As it happens, the first four searches are fine, but the quoted phrases demonstrate a problem. The searches for "release note" and "release notes" match all three records, but "release notice" only matches one, and "release notification" does not match any.

However if I change the last two searches like so:

"release notice"~1
"release notification"~2

then all three documents match.

What appears to be happening is that the first synonym is being given the same index position as the term, the second synonym has the position offset by 1, the third offset by 2, etc.

I believe that all the synonyms should be given the same position so that all four phrases match without the need for proximity modifiers at all.

Edit, here's the source of my analyzer:

public class MyAnalyzer extends Analyzer {
   public MyAnalyzer(String synlist) {
      this.synlist = synlist;
   }

   @Override
   protected TokenStreamComponents createComponents(String fieldName) {
      WhitespaceTokenizer src = new WhitespaceTokenizer();
      TokenStream result = new LowerCaseFilter(src);
      if (synlist != null) {
         result = new SynonymGraphFilter(result, getSynonyms(synlist), Boolean.TRUE);
         result = new FlattenGraphFilter(result);
      }
      return new TokenStreamComponents(src, result);
   }

   private static SynonymMap getSynonyms(String synlist) {
      boolean dedup = Boolean.TRUE;
      SynonymMap synMap = null;
      SynonymMap.Builder builder = new SynonymMap.Builder(dedup);
      int cnt = 0;

      try {
         BufferedReader br = new BufferedReader(new FileReader(synlist));
         String line;
         try {
            while ((line = br.readLine()) != null) {
               processLine(builder,line);
               cnt++;
            }
         } catch (IOException e) {
            System.err.println(" caught " + e.getClass() + " while reading synonym list,\n with message " + e.getMessage());
         }
         System.out.println("Synonym load processed " + cnt + " lines");
         br.close();
      } catch (Exception e) {
         System.err.println(" caught " + e.getClass() + " while loading synonym map,\n with message " + e.getMessage());
      }
      if (cnt > 0) {
         try {
            synMap = builder.build();
         } catch (IOException e) {
            System.err.println(e);
         }
      }
      return synMap;
   }

   private static void processLine(SynonymMap.Builder builder, String line) {
      boolean keepOrig = Boolean.TRUE;
      String terms[] = line.split(",");
      if (terms.length < 2) {
         System.err.println("Synonym input must have at least two terms on a line: " + line);
      } else {
         String word = terms[0];
         String[] synonymsOfWord = Arrays.copyOfRange(terms, 1, terms.length);
         addSyns(builder, word, synonymsOfWord, keepOrig);
      }
   }

   private static void addSyns(SynonymMap.Builder builder, String word, String[] syns, boolean keepOrig) {
      CharsRefBuilder synset = new CharsRefBuilder();
      SynonymMap.Builder.join(syns, synset);
      CharsRef wordp = SynonymMap.Builder.join(word.split("\\s+"), new CharsRefBuilder());
      builder.add(wordp, synset.get(), keepOrig);
   }

   private String synlist;
}

The analyzer includes synonyms when it builds the index, and does not add synonyms when it is used to process a query.


Solution

  • For the "note", "notes", "notice", "notification" list of synonyms:

    It is possible to build an index of the above synonyms so that every query listed in the question will find all three documents - including the phrase searches without the need for any ~n proximity searches.

    I see there is a separate question for the other list of synonyms "subtree", "sub tree", "sub-tree" - so I will skip those here (I expect the below approach will not work for those, but I would have to take a closer look).


    The solution is straightforward, and it's based on a realization that I was (in an earlier question) completely incorrect in an assumption I made about how to build the synonyms:

    You can place multiple synonyms of a given word at the same position as the word, when building your indexed data. I incorrectly thought you needed to provide the synoyms as a list - but you can provide them one at a time as words.


    Here is the approach:

    My analyzer:

    Analyzer analyzer = new Analyzer() {
        @Override
        protected Analyzer.TokenStreamComponents createComponents(String fieldName) {
            Tokenizer source = new StandardTokenizer();
            TokenStream tokenStream = source;
            tokenStream = new LowerCaseFilter(tokenStream);
            tokenStream = new ASCIIFoldingFilter(tokenStream);
            tokenStream = new SynonymGraphFilter(tokenStream, getSynonyms(), ignoreSynonymCase);
            tokenStream = new FlattenGraphFilter(tokenStream);
            return new Analyzer.TokenStreamComponents(source, tokenStream);
        }
    };
    

    The getSynonyms() method used by the above analyzer, using the note,notes,notice,notification list:

    private SynonymMap getSynonyms() {
        // de-duplicate rules when loading:
        boolean dedup = Boolean.TRUE;
        // include original word in index:
        boolean includeOrig = Boolean.TRUE;
    
        String[] synonyms = {"note", "notes", "notice", "notification"};
    
        // build a synonym map where every word in the list is a synonym
        // of every other word in the list:
        SynonymMap.Builder synMapBuilder = new SynonymMap.Builder(dedup);        
        for (String word : synonyms) {
            for (String synonym : synonyms) {
                if (!synonym.equals(word)) {
                    synMapBuilder.add(new CharsRef(word), new CharsRef(synonym), includeOrig);
                }
            }
        }
    
        SynonymMap synonymMap = null;
        try {
            synonymMap = synMapBuilder.build();
        } catch (IOException ex) {
            System.err.print(ex);
        }
        return synonymMap;
    }
    

    I looked at the indexed data by using org.apache.lucene.codecs.simpletext.SimpleTextCodec, to generate human-readable indexes (just for testing purposes):

    IndexWriterConfig iwc = new IndexWriterConfig(analyzer);
    iwc.setOpenMode(OpenMode.CREATE);
    iwc.setCodec(new SimpleTextCodec());
    

    This allowed me to see where the synonyms were inserted into the indexed data. So, for example, taking the word note, we see the following indexed entries:

      term note
        doc 0
          freq 1
          pos 2
        doc 1
          freq 1
          pos 2
        doc 2
          freq 1
          pos 2
    

    So, that tells us that all three documents contain note at token position 2 (the 3rd word).

    And for notification we see exactly the same data:

      term notification
        doc 0
          freq 1
          pos 2
        doc 1
          freq 1
          pos 2
        doc 2
          freq 1
          pos 2
    

    We see this for all the words in the synonym list, which is why all 8 queries return all 3 documents.