Search code examples
algorithmpattern-matchingaho-corasick

State transition table for aho corasick algorithm


Please help me to understand the construction of state transition table for more than one patterns in the Aho-Corasick algorithm.

Please give a simple and detailed explanation so that I could understand.

I am following this paper and here is the animation for that.

Thanks.


Solution

  • Phase 1

    Creating the keyword tree :

    Starting at the root, follow the path labeled by chars of Pi
            If the path ends before Pi, continue it by adding new edges and ...
                                nodes for the remaining characters of P
            Store identifier i of Pi at the terminal node of the path

    I demonstrate it by an example :

    Suppose we have a finite set of patterns P = {he, she, his, hers}.

    enter image description here

    Phase2

    Next we extend the keyword tree into an automaton,to support linear-time matching as follows.

    Obviously the automaton consists of two parts :

    • states
    • transitions

    States : Exactly the nodes achieved by the keyword tree; and initial state = root of the tree.

    enter image description here

    Transitions : using goto(q,a) function as follows.

    /*This function gives the state entered from current state q 
         by matching target char a*/
    - If edge (q,v) in the tree is labeled by a, then g(q,a)=v; 
    - g(0,a)=0 for each a that does not label an edge out of the root
    //the automton stays at the initial state while scanning non-matching characters       
    - O.W. g(q,a)={}

    enter image description here

    Using the automaton :

    SearchofTarget T[1..m] //considering the target string T as an 
                         array of chars indexed from 1   to m
    q:= 0; // initial state(root)
    for i:= 1 to m do
        while g(q,T[i]) == {}  do 
            q:= fail(q); // follow a fail
        q:= goto(q,T[i]); // follow a goto
        if output(q) != {} then print i, out(q);
    endfor;
    

    As you see in the above pseudo-code it calls two functions : fail(q) and output(q)

    fail(q) : // for q !=0 gives the state entered at a mismatch

    failure(q) is the node labeled by the longest proper suffix  w of L(q) ...
        s.t. w is a prefix of some pattern
    //L(q) is the label of node q as the concatenation of edge labels ...
       on the path from the root to q
    

    output(q) :

    gives the set of patterns recognized when entering state q
    

    After computation of these two functions, the automaton looks like this :

    enter image description here

    Now you may wonder when to compute these methods, so please look at these for more official form :

    enter image description here

    Hope this help and if still something is ambiguous please don't hesitate to ask.