Search code examples
dictionarypattern-matchingstring-searchaho-corasick

Aho-corasick search for keyword pairs


Suppose we have a dictionary of keywords

Dictionary A: {A1, A2, A3}

And suppose we have a second dictionary of keywords (distinct from the first)

Dictionary B: {B1, B2, B3, B4}

I'd like to find all possible matches of unordered pairs of keywords in a sequence (i.e., separated only by whitespace) from both dictionaries in an input text. For example, consider the following as an input text

We are not looking for single words from either dictionary on their own, like 
A2 or B4, nor are we looking for sequences of words from only one dictionary, 
like A1 A3 or B4 B2. We are looking for tuples of words from both dictionaries
in a sequence together, like B1 A3 and A2 B4 and B4 A2.

The Aho-Corasick algorithm is a traditional solution for efficiently finding all matches from a single dictionary of keywords in an input text by constructing a trie-like automaton and scanning the text character-by-character.

Is there an efficient way to expand Aho-Corasick for the case of multiple dictionaries?


Solution

  • Yes, you could build a general aho-corasick automaton and an individual for each document:Using Aho-Corasick, can strings be added after the initial tree is built?