Search code examples
c#lucene.net

Lucene Net Search fail if term is too short


I am new to Lucene, so maybe this is a techical limit i dont understand.

I have indexed few text and the try to fetch the content. If i query this text open-source reciprocal productivity with the query source i get a match. If i sue the query sour i also gret a match. But if i use the query sou then i don't get any match.

I am using Lucene .Net version 4.8 Here the code i am using to creating index :

using (var dir = FSDirectory.Open(targetDirectory))
{
    Analyzer analyzer = metadata.GetAnalyzer() ; //return new StandardAnalyzer(LuceneVersion.LUCENE_48);
    
    var indexConfig = new IndexWriterConfig(LuceneVersion.LUCENE_48, analyzer);

    using (IndexWriter writer = new IndexWriter(dir, indexConfig))
    {
        long entryNumber = csvRecords.Count();
        long index = 0;
        long lastPercentage = 0;
        foreach (dynamic csvEntry in csvRecords)
        {
            Document doc = new Document();
            IDictionary<string, object> dynamicCsvEntry = (IDictionary<string, object>)csvEntry;
            var indexedMetadataFiled = metadata.IdexedFields;

            foreach (string headField in header)
            {
                if (indexedMetadataFiled.ContainsKey(headField) == false || (indexedMetadataFiled[headField].NeedToBeIndexed == false && indexedMetadataFiled[headField].NeedToBeStored == false))
                    continue;

                var field = new Field(headField,
                                        ((string)dynamicCsvEntry[headField] ?? string.Empty).ToLower(),
                                        indexedMetadataFiled[headField].NeedToBeStored ? Field.Store.YES : Field.Store.NO, //YES
                                        indexedMetadataFiled[headField].NeedToBeIndexed ? Field.Index.ANALYZED : Field.Index.NO //YES
                                      );

                doc.Add(field);
            }

            long percentage = (long)(((decimal)index / (decimal)entryNumber) * 100m);
            if ( percentage > lastPercentage && percentage % 10 == 0)
            {
                _consoleLogger.Information($"..indexing {percentage}%..");
                lastPercentage = percentage;
            }


            writer.AddDocument(doc);
            index++;
        }

        writer.Commit();
    }
}

And here the code i sue to query the index :

var tokens = Regex.Split(query.Trim(), @"\W+");

BooleanQuery composedQuery = new BooleanQuery();
foreach (var field in luceneHint.FieldsToSearch)
{

    foreach (string word in tokens)
    {
        if (string.IsNullOrWhiteSpace(word))
            continue;

        var termQuery = new FuzzyQuery(new Term(field.FieldName, word.ToLower() ));
        termQuery.Boost = (float)field.Weight;
        composedQuery.Add(termQuery, Occur.SHOULD);
    }
}

var indexManager = IndexManager.Instance;
ReferenceManager<IndexSearcher> index = indexManager.Read(boundle);

int resultLimit = luceneHint?.Top ?? RESULT_LIMIT;
var results = new List<JObject>();
var searcher = index.Acquire();
try
{
    Dictionary<string, FieldDescriptor> filedToRead = (luceneHint?.FieldsToRead?.Any() ?? false) ?
                                                            luceneHint.FieldsToRead.ToDictionary(item => item.FieldName, item => item) :
                                                            new Dictionary<string, FieldDescriptor>();

    bool fetchEveryField = filedToRead.Count == 0;

    TopScoreDocCollector collector = TopScoreDocCollector.Create(resultLimit, true);
    int startPageIndex = pageIndex * itemsPerPage;

    searcher.Search(composedQuery, collector);

    //TopDocs topDocs = searcher.Search(composedQuery, luceneHint?.Top ?? 100);
    TopDocs topDocs = collector.GetTopDocs(startPageIndex, itemsPerPage);
    foreach (var scoreDoc in topDocs.ScoreDocs)
    {
        Document doc = searcher.Doc(scoreDoc.Doc);
        dynamic result = new JObject();
        foreach (var field in doc.Fields)
            if (fetchEveryField || filedToRead.ContainsKey(field.Name))
                result[field.Name] = field.GetStringValue();

        results.Add(result);
    }
}
finally
{
    if ( searcher != null )
        index.Release(searcher);
}

return results;

I am confused, is the fact the i cant get resoult for sou query relate to the fact that the StandardAnalyzer that is used to build the index, use a some stop-word that prevent my query term to be found in the index? (the index stop ad source and sour because those are both english words)

Ps : here the explain plot, even if i don't know how to use it :

searcher.Explain(composedQuery,6) {0 = (NON-MATCH) sum of: } Description: "sum of:" IsMatch: false Match: false Value: 0


Solution

  • The documentation for FuzzyQuery points out that it uses the default minimumSimilarity value of 0.5: https://lucenenet.apache.org/docs/3.0.3/d0/db9/class_lucene_1_1_net_1_1_search_1_1_fuzzy_query.html

    minimumSimilarity - a value between 0 and 1 to set the required similarity between the query term and the matching terms. For example, for a minimumSimilarity of 0.5 a term of the same length as the query term is considered similar to the query term if the edit distance between both terms is less than length(term) * 0.5

    So, it matches "source" when the query is "sour", because removing "ce" requires two edits, the edit distance is 2, and that's <= than length("sour") * 0.5. However, matching "source" to "sou" would need 3 edits, and so it's not a match.

    You should be able to see the same document matching even if you search for something like "bounce" or "sauce", since those are also within two edits from "source".