Search code examples
regexstate-machinefsm

Short example of regular expression converted to a state machine?


In the Stack Overflow podcast #36 (https://blog.stackoverflow.com/2009/01/podcast-36/), this opinion was expressed: Once you understand how easy it is to set up a state machine, you’ll never try to use a regular expression inappropriately ever again.

I've done a bunch of searching. I've found some academic papers and other complicated examples, but I'd like to find a simple example that would help me understand this process. I use a lot of regular expressions, and I'd like to make sure I never use one "inappropriately" ever again.


Solution

  • Sure, although you'll need more complicated examples to truly understand how REs work. Consider the following RE:

    ^[A-Za-z][A-Za-z0-9_]*$
    

    which is a typical identifier (must start with alpha and can have any number of alphanumeric and undescore characters following, including none). The following pseudo-code shows how this can be done with a finite state machine:

    state = FIRSTCHAR
    for char in all_chars_in(string):
        if state == FIRSTCHAR:
                if char is not in the set "A-Z" or "a-z":
                    error "Invalid first character"
                state = SUBSEQUENTCHARS
                next char
        if state == SUBSEQUENTCHARS:
                if char is not in the set "A-Z" or "a-z" or "0-9" or "_":
                    error "Invalid subsequent character"
                state = SUBSEQUENTCHARS
                next char
    

    Now, as I said, this is a very simple example. It doesn't show how to do greedy/nongreedy matches, backtracking, matching within a line (instead of the whole line) and other more esoteric features of state machines that are easily handled by the RE syntax.

    That's why REs are so powerful. The actual finite state machine code required to do what a one-liner RE can do is usually very long and complex.

    The best thing you could do is grab a copy of some lex/yacc (or equivalent) code for a specific simple language and see the code it generates. It's not pretty (it doesn't have to be since it's not supposed to be read by humans, they're supposed to be looking at the lex/yacc code) but it may give you a better idea as to how they work.