Search code examples
lucenefuzzy-search

Apache Lucene fuzzy search for multi-worded phrases


I have the following Apache Lucene 7 application:

StandardAnalyzer standardAnalyzer = new StandardAnalyzer();
Directory directory = new RAMDirectory();
IndexWriterConfig config = new IndexWriterConfig(standardAnalyzer);
IndexWriter writer = new IndexWriter(directory, config);
Document document = new Document();

document.add(new TextField("content", new FileReader("document.txt"))); 
writer.addDocument(document);
writer.close();

IndexReader reader = DirectoryReader.open(directory);
IndexSearcher searcher = new IndexSearcher(reader);

Query fuzzyQuery = new FuzzyQuery(new Term("content", "Company"), 2);

TopDocs results = searcher.search(fuzzyQuery, 5);
System.out.println("Hits: " + results.totalHits);
System.out.println("Max score:" + results.getMaxScore())

when I use it with :

new FuzzyQuery(new Term("content", "Company"), 2);

the application works fine and returns the following result:

Hits: 1
Max score:0.35161147

but when I try to search with multi term query, for example:

new FuzzyQuery(new Term("content", "Company name"), 2);

it returns the following result:

Hits: 0
Max score:NaN

Anyway, the phrase Company name exists in the source document.txt file.

How to properly use FuzzyQuery in this case in order to be able to do the fuzzy search for multi-word phrases.

UPDATED

Based on the provided solution I have tested it on the following text information:

Company name: BlueCross BlueShield              Customer Service 
   1-800-521-2227           
                        of Texas                          Preauth-Medical              1-800-441-9188           
                                                          Preauth-MH/CD                1-800-528-7264           
                                                          Blue Card Access             1-800-810-2583     

For the following query:

SpanQuery[] clauses = new SpanQuery[2];
clauses[0] = new SpanMultiTermQueryWrapper<FuzzyQuery>(new FuzzyQuery(new Term("content", "BlueCross"), 2));
clauses[1] = new SpanMultiTermQueryWrapper<FuzzyQuery>(new FuzzyQuery(new Term("content", "BlueShield"), 2));
SpanNearQuery query = new SpanNearQuery(clauses, 0, true);

the search works fine:

Hits: 1
Max score:0.5753642

but when I try to corrupt a little bit the search query(for example from BlueCross to BlueCros)

SpanQuery[] clauses = new SpanQuery[2];
clauses[0] = new SpanMultiTermQueryWrapper<FuzzyQuery>(new FuzzyQuery(new Term("content", "BlueCros"), 2));
clauses[1] = new SpanMultiTermQueryWrapper<FuzzyQuery>(new FuzzyQuery(new Term("content", "BlueShield"), 2));
SpanNearQuery query = new SpanNearQuery(clauses, 0, true);

it stops working and returns:

Hits: 0
Max score:NaN

Solution

  • The problem here is the following, you're using TextField, which is tokenizing field. E.g. your text "Company name is working on something" would be effectively split by spaces (and others delimeters). So, even if you have the text Company name, during indexation it will become Company, name, is, etc.

    In this case this TermQuery won't be able to find what you're looking for. The trick which going to help you would look like this:

    SpanQuery[] clauses = new SpanQuery[2];
        clauses[0] = new SpanMultiTermQueryWrapper(new FuzzyQuery(new Term("content", "some"), 2));
        clauses[1] = new SpanMultiTermQueryWrapper(new FuzzyQuery(new Term("content", "text"), 2));
        SpanNearQuery query = new SpanNearQuery(clauses, 0, true);
    

    However, I wouldn't recommend this approach much, especially if your load would be big and you're planning on searching on a 10 term long company names. One should be aware, that those query are potentially heavy to execute.

    The following problem with BlueCros is the following. By default Lucene uses StandardAnalyzer for TextField. So it means it effectively lowercase the terms, basically it means that BlueCross in the content field becomes bluecross.

    Fuzzy difference between BlueCros and bluecross is 3, that's the reason you do not have a match.

    Simple proposal would be to convert term in query to the lowercase, by doing something like .toLowerCase()

    In general, one should prefer to use same analyzers during the query time as well (e.g. during construction of the query)