I am new to Lucene, so maybe this is a techical limit i dont understand.
I have indexed few text and the try to fetch the content.
If i query this text open-source reciprocal productivity
with the query source
i get a match.
If i sue the query sour
i also gret a match. But if i use the query sou
then i don't get any match.
I am using Lucene .Net version 4.8 Here the code i am using to creating index :
using (var dir = FSDirectory.Open(targetDirectory))
{
Analyzer analyzer = metadata.GetAnalyzer() ; //return new StandardAnalyzer(LuceneVersion.LUCENE_48);
var indexConfig = new IndexWriterConfig(LuceneVersion.LUCENE_48, analyzer);
using (IndexWriter writer = new IndexWriter(dir, indexConfig))
{
long entryNumber = csvRecords.Count();
long index = 0;
long lastPercentage = 0;
foreach (dynamic csvEntry in csvRecords)
{
Document doc = new Document();
IDictionary<string, object> dynamicCsvEntry = (IDictionary<string, object>)csvEntry;
var indexedMetadataFiled = metadata.IdexedFields;
foreach (string headField in header)
{
if (indexedMetadataFiled.ContainsKey(headField) == false || (indexedMetadataFiled[headField].NeedToBeIndexed == false && indexedMetadataFiled[headField].NeedToBeStored == false))
continue;
var field = new Field(headField,
((string)dynamicCsvEntry[headField] ?? string.Empty).ToLower(),
indexedMetadataFiled[headField].NeedToBeStored ? Field.Store.YES : Field.Store.NO, //YES
indexedMetadataFiled[headField].NeedToBeIndexed ? Field.Index.ANALYZED : Field.Index.NO //YES
);
doc.Add(field);
}
long percentage = (long)(((decimal)index / (decimal)entryNumber) * 100m);
if ( percentage > lastPercentage && percentage % 10 == 0)
{
_consoleLogger.Information($"..indexing {percentage}%..");
lastPercentage = percentage;
}
writer.AddDocument(doc);
index++;
}
writer.Commit();
}
}
And here the code i sue to query the index :
var tokens = Regex.Split(query.Trim(), @"\W+");
BooleanQuery composedQuery = new BooleanQuery();
foreach (var field in luceneHint.FieldsToSearch)
{
foreach (string word in tokens)
{
if (string.IsNullOrWhiteSpace(word))
continue;
var termQuery = new FuzzyQuery(new Term(field.FieldName, word.ToLower() ));
termQuery.Boost = (float)field.Weight;
composedQuery.Add(termQuery, Occur.SHOULD);
}
}
var indexManager = IndexManager.Instance;
ReferenceManager<IndexSearcher> index = indexManager.Read(boundle);
int resultLimit = luceneHint?.Top ?? RESULT_LIMIT;
var results = new List<JObject>();
var searcher = index.Acquire();
try
{
Dictionary<string, FieldDescriptor> filedToRead = (luceneHint?.FieldsToRead?.Any() ?? false) ?
luceneHint.FieldsToRead.ToDictionary(item => item.FieldName, item => item) :
new Dictionary<string, FieldDescriptor>();
bool fetchEveryField = filedToRead.Count == 0;
TopScoreDocCollector collector = TopScoreDocCollector.Create(resultLimit, true);
int startPageIndex = pageIndex * itemsPerPage;
searcher.Search(composedQuery, collector);
//TopDocs topDocs = searcher.Search(composedQuery, luceneHint?.Top ?? 100);
TopDocs topDocs = collector.GetTopDocs(startPageIndex, itemsPerPage);
foreach (var scoreDoc in topDocs.ScoreDocs)
{
Document doc = searcher.Doc(scoreDoc.Doc);
dynamic result = new JObject();
foreach (var field in doc.Fields)
if (fetchEveryField || filedToRead.ContainsKey(field.Name))
result[field.Name] = field.GetStringValue();
results.Add(result);
}
}
finally
{
if ( searcher != null )
index.Release(searcher);
}
return results;
I am confused, is the fact the i cant get resoult for sou
query relate to the fact that the StandardAnalyzer that is used to build the index, use a some stop-word that prevent my query term to be found in the index? (the index stop ad source
and sour
because those are both english words)
Ps : here the explain plot, even if i don't know how to use it :
searcher.Explain(composedQuery,6) {0 = (NON-MATCH) sum of: } Description: "sum of:" IsMatch: false Match: false Value: 0
The documentation for FuzzyQuery points out that it uses the default minimumSimilarity value of 0.5: https://lucenenet.apache.org/docs/3.0.3/d0/db9/class_lucene_1_1_net_1_1_search_1_1_fuzzy_query.html
minimumSimilarity - a value between 0 and 1 to set the required similarity between the query term and the matching terms. For example, for a minimumSimilarity of 0.5 a term of the same length as the query term is considered similar to the query term if the edit distance between both terms is less than length(term) * 0.5
So, it matches "source" when the query is "sour", because removing "ce" requires two edits, the edit distance is 2, and that's <= than length("sour") * 0.5. However, matching "source" to "sou" would need 3 edits, and so it's not a match.
You should be able to see the same document matching even if you search for something like "bounce" or "sauce", since those are also within two edits from "source".