Search code examples
regexcompiler-constructioncomputation-theory

What are the type of Strings generated by (a*+b*)


Besides strings of any number of a's and b's like aa.. or bb.. ,Would regular expression (a*+b*) contain a string like

ab

or any string ending with b ?

Is (a*+b*) same as (a* b*) ?

I am a little bit confuse about the strings generated by regular expression (a*+b*) and would really appreciate if someone can help.


Solution

  • Unless you're working with a regex language which explicitly classifies the *+ as a special token which either has a special meaning, or is reserved for future extension (and produces defined behavior now, or a syntax error), the natural parse of a*+ is that it means (a*)+: the postfix + is applied to the expression a*.

    If that interpretation applies, next we can observe that (a*)+ is equivalent to just a*. Therefore a*+b* is the same as a*b*.

    Firstly, by definition R+ means RR*. Match one R and then zero or more of them. Therefore, we can rewrite (a*)+ as (a*)(a*)*.

    Secondly, * is idempotent, so (a*)* is is just (a*). If we match "zero or more a", zero or more times, nothing changes; the net effect is zero or more a. Proof: R* denotes this infinite expansion: (|R|RR|RRR|RRRR|RRRRR|...): match nothing, or match one R, or match two R's, ... Therefore, (a*)* dentes this expansion: (|a*|a*a*|a*a*a*|...). These inner a*-s in turn denote individual second-level expansions: (|(|a|aa|aaa|...|)|(|a|aa|aaa|...)(a|a|aaa|...))|...). By the associative property of the branch |, we can flatten a structure like (a|(b|c)) into (a|b|c), and when we do this to the expansion, we note that there are numerous identical terms—the empty regex (), the single a, the double aa and so on. These all reduce to a single copy, because (|||) is equivalent to () and (a|a|a|a|...) is equivalent to just (a) and so on. That is to say, when we sort the terms by increasing length, and squash multiple identical terms to just one copy, we end up with (|a|aa|aaa|aaaa|...), which is recognizable as the expansion of just a*. Thus (a*)* is a*.

    Lastly, (a*)(a*) just means a*. Proof: Similarly to the previous, we expand into branches: (|a|aa|aaa|...)(|a|aa|aaa|...). Next we note that the catenation of branch expressions is equivalent to a Cartesian Product set of the terms. That is to say (a|b|c|..)(i|j|k|...) means, precisely: (ai|aj|ik|...|bi|bj|bk|...|ci|cj|ck|...|...). When we apply this product to (|a|aa|aaa|...)(|a|aa|aaa|...) we get a proliferation of terms, which, when arranged in increasing length and subject to de-duplication, reduce to (|a|aa|aaa|aaaa|...), and that is just a*.