Search code examples
regexdfa

Minimal DFA for whole-words in regex


In creating a DFA for a regex, I have noticed that whole-words add to the number of states, even though analytically, they look similar to regex with fewer states.

For example, to me, (a|b)+ looks the same as (hello|world)+

If I had a matching string, it would be fairly easily to find/replace "hello" with "a" and "world" with "b" and viceversa. So my question is, why aren't "hello" and "world" counted as single states?


Solution

  • Because DFAs are mind-numbingly simple to implement with the simpler definition of states, at the cost of having more states. What you suggest is fine for describing how you want a DFA to work, and have a straight-forward correspondence to the traditional DFAs. But it doesn't allow you to say anything more.

    It is similar to the use of NFAs: they're easier to design and (maybe) think about, but have no more power, and there is a well-defined algorithm to translate them into DFAs (again, at the cost of introducing states).

    Think of DFAs using single-character transitions as the "machine language" of regular expressions (which are NOT the same thing as regex, to get pedantic).