Search code examples
compilationexpressionregular-languageformal-languages

Regular Expression Format Confusion


I'm trying to get my head around some regular expression to later program a compiler.

if I have the expression:

(a or b)*

Is this the same as a* or b*? Or does it mean that you can choose a or b zero or more times.

For example, using this regular expression, can I generate {ababababa} or only strings of {aaaaaaa} or {bbbbbbb}? If the input symbol is a b then does that mean only the b can occur zero or more times or can a occur the second time too?

Thanks very much


Solution

  • In most regex libraries, the or operator is spelled |, so your regex would be (a|b)*.

    That indeed means "any string of any length (including 0) made up only of as and bs". In other words, the parentheses work just as in any algebraic expression, to define a subexpression: the * (postfix) operator is applied to the subexpression a|b.

    Interesting fact: (a*b*)* is exactly the same set of strings as (a|b)*.