Search code examples
c#lucenelucene.netfuzzy-search

Why does this Lucene.Net query fail?


I am trying to convert my search functionality to allow for fuzzy searches involving multiple words. My existing search code looks like:

        // Split the search into seperate queries per word, and combine them into one major query
        var finalQuery = new BooleanQuery();

        string[] terms = searchString.Split(new[] { " " }, StringSplitOptions.RemoveEmptyEntries);
        foreach (string term in terms)
        {
            // Setup the fields to search
            string[] searchfields = new string[] 
            {
                // Various strings denoting the document fields available
            };

            var parser = new MultiFieldQueryParser(Lucene.Net.Util.Version.LUCENE_29, searchfields, new StandardAnalyzer(Lucene.Net.Util.Version.LUCENE_29));
            finalQuery.Add(parser.Parse(term), BooleanClause.Occur.MUST);
        }

        // Perform the search
        var directory = FSDirectory.Open(new DirectoryInfo(LuceneIndexBaseDirectory));
        var searcher = new IndexSearcher(directory, true);
        var hits = searcher.Search(finalQuery, MAX_RESULTS);

This works correctly, and if I have an entity with the name field of "My name is Andrew", and I perform a search for "Andrew Name", Lucene correctly finds the correct document. Now I want to enable fuzzy searching, so that "Anderw Name" is found correctly. I changed my method to use the following code:

        const int MAX_RESULTS = 10000;
        const float MIN_SIMILARITY = 0.5f;
        const int PREFIX_LENGTH = 3;

        if (string.IsNullOrWhiteSpace(searchString))
            throw new ArgumentException("Provided search string is empty");

        // Split the search into seperate queries per word, and combine them into one major query
        var finalQuery = new BooleanQuery();

        string[] terms = searchString.Split(new[] { " " }, StringSplitOptions.RemoveEmptyEntries);
        foreach (string term in terms)
        {
            // Setup the fields to search
            string[] searchfields = new string[] 
            {
                // Strings denoting document field names here
            };

            // Create a subquery where the term must match at least one of the fields
            var subquery = new BooleanQuery();
            foreach (string field in searchfields)
            {
                var queryTerm = new Term(field, term);
                var fuzzyQuery = new FuzzyQuery(queryTerm, MIN_SIMILARITY, PREFIX_LENGTH);
                subquery.Add(fuzzyQuery, BooleanClause.Occur.SHOULD);
            }

            // Add the subquery to the final query, but make at least one subquery match must be found
            finalQuery.Add(subquery, BooleanClause.Occur.MUST);
        }

        // Perform the search
        var directory = FSDirectory.Open(new DirectoryInfo(LuceneIndexBaseDirectory));
        var searcher = new IndexSearcher(directory, true);
        var hits = searcher.Search(finalQuery, MAX_RESULTS);

Unfortunately, with this code if I submit the search query "Andrew Name" (same as before) I get zero results back.

The core idea is that all terms must be found in at least one document field, but each term can reside in different fields. Does anyone have any idea why my rewritten query fails?


Final Edit: Ok it turns out I was over complicating this by a LOT, and there was no need to change from my first approach. After reverting back to the first code snippet, I enabled fuzzy searching by changing

finalQuery.Add(parser.Parse(term), BooleanClause.Occur.MUST);

to

finalQuery.Add(parser.Parse(term.Replace("~", "") + "~"), BooleanClause.Occur.MUST);

Solution

  • Your code works for me if I rewrite the searchString to lower-case. I'm assuming that you're using the StandardAnalyzer when indexing, and it will generate lower-case terms.

    You need to 1) pass your tokens through the same analyzer (to enable identical processing), 2) apply the same logic as the analyzer or 3) use an analyzer which matches the processing you do (WhitespaceAnalyzer).