Search code examples
transitiondfaformal-languagesnfa

Converting regex to a NFA transistion table


Doesn't matter the language, but I need to figure out how to convert a regex to a NFA table. For example "(ab)* + ba" turns into
T | a | b | ^
0 | N | 1 | 2
1 | 3 | N | N
2 | 4 | N | 3
3 | N | N | N
4 | N | 2 | N

If anyone could help point my in the right direction or show me how this could be done that would be much appreciated.

Edit: I took a look at:
http://www.cs.may.ie/staff/jpower/Courses/Previous/parsing/node5.html, but I was still unable to get an idea of how to program this


Solution

  • First, find the outermost operation. In your example, it is +. When you have a +, it means that you are able to accept either the thing on the left or the thing on the right. We can encode this in an NFA using empty (lambda, or epsilon) transitions as follows:

    Q    s    Q'
    q0   -    M1
    q0   -    M2
    

    We have q0 as our starting point and we use M1 and M2 to represent machines which accept strings generated by the LHS and RHS, respectively, of our regular expression. When we say q0 transitions to M1 and M2 on lambda/epsilon - empty transitions - we mean that we nondeterministically choose which path to go down. The transition will be from q0 to the initial states of M1 and M2, whatever those states happen to be.

    Now we repeat the process recursively on each of the LHS and RHS. We can begin with the RHS since it is simpler. The outermost operation here is concatenation (of the symbols a and b). Concatenation is simple to represent:

    Q    s    Q'
    q2   -    M3
    M3   -    M4
    

    Here, q2 is the initial state of M2 from before, and M3 and M4 represent as-of-yet undetermined machines which accept the LHS and RHS, respectively, of the concatenation of a and b. When we say q2 transitions to M3, we mean it transitions to M3's initial state; and when we say M3 transitions to M4, we mean all accepting states of M3 transition to the initial state of M4.

    Proceeding recursively, we now need machines for a and b. These both have the form:

    Q    s    Q'
    q    x    q'
    

    Where q is the initial state, x is the symbol and q' is an accepting state. So we get:

    Q    s    Q'
    q3   b    q4   (q3 initial, q4 accepting)
    

    and

    Q    s    Q'
    q5   a    q6   (q5 initial, q6 accepting)
    

    We have hit the bottom of this recursion branch and can step back up, generating concrete entries in the transition table based on the concrete machines we have defined. We had this:

    Q    s    Q'
    q2   -    M3
    M3   -    M4
    

    And now we know what M3 and M4 look like, so we can substitute:

    Q    s    Q'
    q2   -    q3
    q3   b    q4
    q4   -    q5
    q5   a    q6    (q2 initial, q6 accepting)
    

    Now we are ready to do the LHS from the + operation. The outermost operation is *. The way we handle these in NFAs is as follows:

    Q    s    Q'
    q7   -    M5
    M5   -    M5
    

    We now consider the next operation, concatenation. We have already covered this and we know we get this:

    Q    s    Q'
    q8   -    M6
    M6   -    M7
    

    Now we need a and b. Again, we know these look like:

    Q    s    Q'
    q9   a    q10
    

    and

    Q    s    Q'
    q11  b    q12
    

    We put it all back together:

    Q    s    Q'
    q8   -    q9
    q9   a    q10
    q10  -    q11
    q11  b    q12    (q8 initial, q12 accepting)
    

    Then we do the Kleene star:

    Q    s    Q'
    q7   -    q8
    q8   -    q9
    q9   a    q10
    q10  -    q11
    q11  b    q12
    q12  -    q8    (q8 initial, q8 and q12 accepting)
    

    Finally, we combine all the rules in one big transition table:

    Q    s    Q'
    
    q0   -    q2
    q0   -    q7
    
    q2   -    q3
    q3   b    q4
    q4   -    q5
    q5   a    q6
    
    q7   -    q8
    q8   -    q9
    q9   a    q10
    q10  -    q11
    q11  b    q12
    q12  -    q8    (q0 initial, q6, q8 and q12 accepting)
    

    Thus you can recursively construct an NFA for any regular expression. The resulting NFA will have some unnecessary states in the general case but NFA optimization is a delicate topic. You can always take this (or any) NFA, convert to a DFA using a known algorithm and then minimize using a known algorithm. Then you have a provably minimal DFA, though it might be much bigger than even this padded NFA!