Search code examples
javaalgorithmfull-text-searchstring-matchingaho-corasick

Aho-Corasick text matching on whole words?


I'm using Aho-Corasick text matching and wonder if it could be altered to match terms instead of characters. In other words, I want the the terms to be the basis of matching rather than the characters. As an example:

Search query: "He",

Sentence: "Hello world",

Aho-Corasick will match "he" to the sentence "hello world" ending at index 2, but I would prefer to have no match. So, I mean by "terms" words rather than characters.


Solution

  • One way to do this would be to use Aho-Corasick as usual, then do a filtering step where you eliminate all false positives. For example, every time you find a match, you can confirm that the next and previous characters in the input are non-letter characters like spaces or punctuation. That way, you get the speed of the Aho-Corasick lookup, but only consider matches that appear as whole words in the text.

    Hope this helps!