I have a text of the Word document and an array of the strings. The goal is to find all occurrences for those strings in the document's text. I tried to use Aho-Corasick string matching in C# implementation of the Aho-Corasick algorithm but the default implementation doesn't fit for me. The typical part of the text looks like
“Activation” means a written notice from Lender to the Bank substantially in the form of Exhibit A.
“Activation Notice” means a written notice from Lender to the Bank substantially in the form of Exhibit A and Activation.
“Business Day" means each day (except Saturdays and Sundays) on which banks are open for general business and Activation Notice.
The array of the keywords looks like
var keywords = new[] {"Activation", "Activation Notice"};
The default implementation of the Aho-Corasick algorithm returns the following count of the occurrences
Activation - 4
Activation Notice - 2
For 'Activation Notes' it's the correct result. But for 'Activation' the correct count should be also 2 because I do not need to consider occurrences inside the adjacent keyword 'Activation Notice'.
Is there a proper algorithm for this case?
I will assume you got your results according to the example you linked.
StringSearchResult[] results = searchAlg.FindAll(textToSearch);
With those results
, if you assume that the only overlaps are subsets, you can sort by index and collect your desired results in a single pass.
public class SearchResultComparer : IComparer<StringSearchResult> {
public int StringSearchResult(StringSearchResult x, StringSearchResult y)
{
// Try ordering by the start index.
int compare = x.Index.CompareTo(y.Index);
if (compare == 0)
{
// In case of ties, reverse order by keyword length.
compare = y.Keyword.Length.CompareTo(x.Keyword.Length);
}
return compare;
}
}
// ...
IComparer searchResultComparer = new SearchResultComparer();
Array.Sort(results, searchResultComparer);
int activeEndIndex = -1;
List<StringSearchResult> nonOverlappingResults = new List<StringSearchResult>();
foreach(StringSearchResult r in results)
{
if (r.Index < activeEndIndex)
{
// This range starts before the active range ends.
// Since it's an overlap, skip it.
continue;
}
// Save this result, track when it ends.
nonOverlappingResults.Add(r);
activeEndIndex = r.Index + r.Keyword.Length;
}
Due to the index sorting, the loop guarantees that only non-overlapping ranges will be kept. But some ranges will be rejected. This can only happen for two reasons.
Therefore the only rejected candidates will be subsets, and must end before the active range. So the active range remains the only thing to worry about overlapping with.