Search code examples
c++algorithmstring-search

String Finding Alg w/ Lowest Freq Char


I have 3 text files. One with a set of text to be searched through
(ex. ABCDEAABBCCDDAABC)
One contains a number of patterns to search for in the text
(ex. AB, EA, CC)
And the last containing the frequency of each character
(ex.
A 4
B 4
C 4
D 3
E 1
)
I am trying to write an algorithm to find the least frequent occurring character for each pattern and search a string for those occurrences, then check the surrounding letters to see if the string is a match. Currently, I have the characters and frequencies in their own vectors, respectively. (Where i=0 for each vector would be A 4, respectively.

Is there a better way to do this? Maybe a faster data structure? Also, what are some efficient ways to check the pattern string against the piece of the text string once the least frequent letter is found?


Solution

  • You can run the Aho-Corasick algorithm. Its complexity (once the preprocessing - whose complexity is unrelated to the text - is done), is Θ(n + p), where

    • n is the length of the text

    • p is the total number of matches found

    This is essentially optimal. There is no point in trying to skip over letters that appear to be frequent:

    1. If the letter is not part of a match, the algorithm takes unit time.

    2. If the letter is part of a match, then the match includes all letters, irrespective of their frequency in the text.