Search code examples
c#algorithmstring-searchaho-corasick

Find occurrences of the adjacent sub strings in the text


I have a text of the Word document and an array of the strings. The goal is to find all occurrences for those strings in the document's text. I tried to use Aho-Corasick string matching in C# implementation of the Aho-Corasick algorithm but the default implementation doesn't fit for me. The typical part of the text looks like

Activation” means a written notice from Lender to the Bank substantially in the form of Exhibit A.

Activation Notice” means a written notice from Lender to the Bank substantially in the form of Exhibit A and Activation.

Business Day" means each day (except Saturdays and Sundays) on which banks are open for general business and Activation Notice.

The array of the keywords looks like

var keywords = new[] {"Activation", "Activation Notice"};

The default implementation of the Aho-Corasick algorithm returns the following count of the occurrences

Activation - 4

Activation Notice - 2

For 'Activation Notes' it's the correct result. But for 'Activation' the correct count should be also 2 because I do not need to consider occurrences inside the adjacent keyword 'Activation Notice'.

Is there a proper algorithm for this case?


Solution

  • I will assume you got your results according to the example you linked.

    StringSearchResult[] results = searchAlg.FindAll(textToSearch);
    

    With those results, if you assume that the only overlaps are subsets, you can sort by index and collect your desired results in a single pass.

    public class SearchResultComparer : IComparer<StringSearchResult> { 
        public int StringSearchResult(StringSearchResult x, StringSearchResult y) 
        { 
            // Try ordering by the start index.
            int compare = x.Index.CompareTo(y.Index);
            if (compare == 0)
            {
                // In case of ties, reverse order by keyword length.
                compare = y.Keyword.Length.CompareTo(x.Keyword.Length);
            }
            return compare;
        } 
    } 
    
    // ...
    
    
    IComparer searchResultComparer = new SearchResultComparer();
    Array.Sort(results, searchResultComparer); 
    
    int activeEndIndex = -1;
    List<StringSearchResult> nonOverlappingResults = new List<StringSearchResult>();
    foreach(StringSearchResult r in results)
    {
        if (r.Index < activeEndIndex)
        {
            // This range starts before the active range ends.
            // Since it's an overlap, skip it.
            continue;
        }
    
        // Save this result, track when it ends.
        nonOverlappingResults.Add(r);
        activeEndIndex = r.Index + r.Keyword.Length;
    }
    

    Due to the index sorting, the loop guarantees that only non-overlapping ranges will be kept. But some ranges will be rejected. This can only happen for two reasons.

    1. The candidate starts at the same index as the active range. Since the sorting breaks these ties so longest goes first, the candidate must be shorter than the active range and can be skipped.
    2. The candidate starts after the active range. Since the only overlaps are subsets, and this overlaps with the active range, it is a subset that starts later but still ends at or before.

    Therefore the only rejected candidates will be subsets, and must end before the active range. So the active range remains the only thing to worry about overlapping with.