Search code examples
searchlucenefull-text-searchfuzzy-search

Apache Lucene and fuzzy search in text document by list of candidate words


I have the text file after OCR process. Very often this text contains corrupted words because of bad image quality of the original documents.

Also I have a list of the valid company names which can occur in the mentioned text file.

Based on this company names list I'd like to determine the company name which owns the scanned document(even if the company name is a slightly corrupted inside of the text file).

I'd like to do something like fuzzy search over the scanned document in order to try to find company name from the list there. The winner will be the company name with the highest score matching.

I think I can use Apache Lucene functionality for this purpose. Could you please suggest or it is possible to implement with Apache Lucene and if so show an example.


Solution

  • The proposed idea is the following. You could create a Lucene document for each company name (or even description and anything useful info as well)

    Document doc = new Document();
            doc.add(new TextField("text", "BlueCross BlueShield", Field.Store.YES));
            writer.addDocument(doc);
    

    After adding all companies, you could use the text you got as an MoreLikeThis query. The idea behind MLT is following, it's trying to find similar text.

    One could create is a follows:

    MoreLikeThis mlt = new MoreLikeThis(reader);
            mlt.setAnalyzer(analyzer);
            mlt.setMinDocFreq(0);
            mlt.setMinTermFreq(0);
            mlt.setMinWordLen(0);
            final Query query = mlt.like("text", new StringReader("BlueCros BlueShield              Customer Service \n" +
                    "   1-800-521-2227           \n" +
                    "                        of Texas                          Preauth-Medical              1-800-441-9188           \n" +
                    "                                                          Preauth-MH/CD                1-800-528-7264           \n" +
                    "                                                          Blue Card Access             1-800-810-2583           "));
            System.out.println(query);
    
            TopDocs results = searcher.search(query, 5);
    

    Overall, we are doing the reverse matching and it should helps you, I did some tests there. The tricky part is for the fuzzy match, since MLT do not provides it, so in this case one could rewrite MLT query to wrap it into FuzzyQuery.

    BooleanQuery.Builder builder = new BooleanQuery.Builder();
            if (query instanceof BooleanQuery) {
                final List<BooleanClause> clauses = ((BooleanQuery) query).clauses();
                for (BooleanClause bc : clauses) {
                    Query q = bc.getQuery();
                    if (q instanceof TermQuery) {
                        builder.add(new FuzzyQuery(((TermQuery) q).getTerm(), 2), bc.getOccur());
                    } else {
                        builder.add(bc);
                    }
                }
            }
    

    Also, it's very important to use proper Analyzer - in the simple case of BlueCross I've provided the one which will split the tokens on the uppercase change. It might be helpful to add also synonyms there

    The full code example is located here