Search code examples
javalucene

Lucene: Search and retrieve based on relevance


I am using Lucene for indexing and searching. Below is my code I use for searching. But in the current code the results are sorted. But I want the results to be based on the relevance. Suppose If I search for a word like "a b c", I want my search get the results that match "a b c" and then "a b" or "b c" and finally "a", "b", "c" but currently the results are sorted.

public void createIndex(final Set<PropertyMetaData> propertyMetaData) throws Exception {

        try {
            final Directory newIndex = new RAMDirectory();
            IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_36, wrapper);
            final IndexWriter writer = new IndexWriter(newIndex, config);
            for (final PropertyMetaData propMetaData : propertyMetaData) {
                addDoc(writer, propMetaData.getPropertyKey(),
                        propMetaData.getDisplayName(),
                        propMetaData.getPropertyID(),
                        propMetaData.getMainTab(), propMetaData.getSubTab());
            }
            writer.commit();
            writer.close();
            final IndexReader newReader = IndexReader.open(newIndex);
            final IndexSearcher newSearcher = new IndexSearcher(newReader);

            final Directory newSpellIndex = new RAMDirectory();
            final SpellChecker newSpellChecker = new SpellChecker(newSpellIndex);
            for (final String field : dictionaryFields) {
                final Dictionary dictionary = new LuceneDictionary(newReader, field);
                config = new IndexWriterConfig(Version.LUCENE_36, wrapper);
                newSpellChecker.indexDictionary(dictionary, config, false);
            }
            logger.info("New indexes for data and dictionary are created.");

            synchronized (this) {
                while (activeSearches != 0) {
                    this.wait();
                }

                // Close all the old resources
                if (searcher != null)
                    searcher.close();
                if (reader != null)
                    reader.close();
                if (index != null)
                    index.close();
                if (spellIndex != null)
                    spellIndex.close();

                index = newIndex;
                reader = newReader;
                searcher = newSearcher;
                spellIndex = newSpellIndex;
                spellChecker = newSpellChecker;
                indexCreated = true;
                logger.info("Start using new index for new searches.");
            }
        } catch (final Exception e) {
            synchronized (this) {
                indexCreated = false;
            }
            throw (e);
        }
    }


    public Set<PropertyMetaData> search(String searchString)
            throws ParseException, CorruptIndexException, IOException {
        final Set<PropertyMetaData> propertyMetaData = new HashSet<PropertyMetaData>();

        if (searchString == null || searchString.trim().isEmpty())
            return propertyMetaData;

        synchronized (this) {
            if (indexCreated == false) {
                throw new CorruptIndexException("Warning: Search is called " +
                        "before data index is initialized.");
            }
            activeSearches++;
        }

        // StopWatch stopWatch = new StopWatch(true);
        final TopScoreDocCollector collector = TopScoreDocCollector.create(
                hitsPerPage, true);
        // stopWatch.stop();
        // System.out.println("Collector init = " + stopWatch.readMiliseconds());

        BooleanQuery finalQuery = new BooleanQuery();
        final QueryParser parser = new MultiFieldQueryParser(Version.LUCENE_36,
                searchfields, wrapper);

        searchString = searchString.trim();
        finalQuery.add(parser.parse(searchString),
                BooleanClause.Occur.SHOULD);
        logger.info("Original Lucene query: " + finalQuery.toString());
        finalQuery = (BooleanQuery) widenQuery(finalQuery);
        logger.info("Lucene query to be executed - " + finalQuery.toString());
        // stopWatch = new StopWatch(true);
        searcher.search(finalQuery, collector);
        // stopWatch.stop();
        // System.out.println("Search time taken = " + stopWatch.readMiliseconds());
        // stopWatch = new StopWatch(true);
        final ScoreDoc[] hits = collector.topDocs().scoreDocs;
        logger.info("No. of hits - " + hits.length);
        // stopWatch.stop();
        // System.out.println("Hit score docs array = " + stopWatch.readMiliseconds());
        // stopWatch = new StopWatch(true);
        for (final ScoreDoc hit : hits) {
            final int docId = hit.doc;
            final Document d = searcher.doc(docId);
            final PropertyMetaData propMetaData = new PropertyMetaData(
                    d.get(Fields.SubTab.toString()),
                    d.get(Fields.MainTab.toString()),
                    d.get(Fields.title.toString()),
                    d.get(Fields.key.toString()),
                    d.get(Fields.MyIdentifierKey.toString()));
            propertyMetaData.add(propMetaData);
        }
        // stopWatch.stop();
        // System.out.println("Prepare property meta data set = " + stopWatch.readMiliseconds());

        synchronized (this) {
            activeSearches--;
            this.notifyAll();
        }
        return propertyMetaData;
    }

How can I retrieve the results based on the relevance, when we do search on multiple words?


Solution

  • By default, Lucene sorts based on TEXT-RELEVANCE only. There are quite a few factors that contribute to the relevance score.

    It is possible that tf-idf values and length normalization might have affected your scores resulting in having "a b" / "b c" documents show up at top ranked results than the documents containing "a b c".

    The way you can overcome above is that To boost the relevance score based on number of matching query terms. You may follow the below steps.

    1) Write a customized Similarity class extending from DefaultSimilarity. If you are wondering what's Similarity, it is the class used by Lucene that contains all the formulas of scoring factors that contribute to the score.

    Tutorial : Lucene Scoring

    2) Override DefaultSimilarity.coord()

    coord() explanation in the Lucene documentation.

    coord(q,d) is a score factor based on how many of the query terms are found in the specified document. Typically, a document that contains more of the query's terms will receive a higher score than another document with fewer query terms. This is a search time factor computed in coord(q,d) by the Similarity in effect at search time. 
    

    3) The default implementation of coord is overlap/maxoverlap. You may experiment with different formulas such that the documents containing more query words show up in the top results. The following formulas might be good starting points.

       1) coord return value = Math.sqrt(overlap/maxoverlap)
       2) coord return value = overlap;
    

    4) You do NOT have to override other methods since the DefaultSimilarity has default implementations for all scoring factors. Just touch the one you want to experiment with, which is coord() in your case. If you extend from Similarity, you've to provide all the implementations.

    5) Similarity can be passed to the IndexSearcher using IndexSearcher.setSimilarity()