Doesn't matter the language, but I need to figure out how to convert a regex to a NFA table.
For example "(ab)* + ba" turns into
T | a | b | ^
0 | N | 1 | 2
1 | 3 | N | N
2 | 4 | N | 3
3 | N | N | N
4 | N | 2 | N
If anyone could help point my in the right direction or show me how this could be done that would be much appreciated.
Edit: I took a look at:
http://www.cs.may.ie/staff/jpower/Courses/Previous/parsing/node5.html, but I was still unable to get an idea of how to program this
First, find the outermost operation. In your example, it is +
. When you have a +
, it means that you are able to accept either the thing on the left or the thing on the right. We can encode this in an NFA using empty (lambda, or epsilon) transitions as follows:
Q s Q'
q0 - M1
q0 - M2
We have q0
as our starting point and we use M1
and M2
to represent machines which accept strings generated by the LHS and RHS, respectively, of our regular expression. When we say q0
transitions to M1
and M2
on lambda/epsilon - empty transitions - we mean that we nondeterministically choose which path to go down. The transition will be from q0
to the initial states of M1
and M2
, whatever those states happen to be.
Now we repeat the process recursively on each of the LHS and RHS. We can begin with the RHS since it is simpler. The outermost operation here is concatenation (of the symbols a
and b
). Concatenation is simple to represent:
Q s Q'
q2 - M3
M3 - M4
Here, q2
is the initial state of M2
from before, and M3
and M4
represent as-of-yet undetermined machines which accept the LHS and RHS, respectively, of the concatenation of a
and b
. When we say q2
transitions to M3
, we mean it transitions to M3
's initial state; and when we say M3
transitions to M4
, we mean all accepting states of M3
transition to the initial state of M4
.
Proceeding recursively, we now need machines for a
and b
. These both have the form:
Q s Q'
q x q'
Where q
is the initial state, x
is the symbol and q'
is an accepting state. So we get:
Q s Q'
q3 b q4 (q3 initial, q4 accepting)
and
Q s Q'
q5 a q6 (q5 initial, q6 accepting)
We have hit the bottom of this recursion branch and can step back up, generating concrete entries in the transition table based on the concrete machines we have defined. We had this:
Q s Q'
q2 - M3
M3 - M4
And now we know what M3
and M4
look like, so we can substitute:
Q s Q'
q2 - q3
q3 b q4
q4 - q5
q5 a q6 (q2 initial, q6 accepting)
Now we are ready to do the LHS from the +
operation. The outermost operation is *
. The way we handle these in NFAs is as follows:
Q s Q'
q7 - M5
M5 - M5
We now consider the next operation, concatenation. We have already covered this and we know we get this:
Q s Q'
q8 - M6
M6 - M7
Now we need a
and b
. Again, we know these look like:
Q s Q'
q9 a q10
and
Q s Q'
q11 b q12
We put it all back together:
Q s Q'
q8 - q9
q9 a q10
q10 - q11
q11 b q12 (q8 initial, q12 accepting)
Then we do the Kleene star:
Q s Q'
q7 - q8
q8 - q9
q9 a q10
q10 - q11
q11 b q12
q12 - q8 (q8 initial, q8 and q12 accepting)
Finally, we combine all the rules in one big transition table:
Q s Q'
q0 - q2
q0 - q7
q2 - q3
q3 b q4
q4 - q5
q5 a q6
q7 - q8
q8 - q9
q9 a q10
q10 - q11
q11 b q12
q12 - q8 (q0 initial, q6, q8 and q12 accepting)
Thus you can recursively construct an NFA for any regular expression. The resulting NFA will have some unnecessary states in the general case but NFA optimization is a delicate topic. You can always take this (or any) NFA, convert to a DFA using a known algorithm and then minimize using a known algorithm. Then you have a provably minimal DFA, though it might be much bigger than even this padded NFA!