Search code examples
regexcontext-free-grammarregular-languageformal-languages

How do I find the language from a regular expression?


How would I find the language for the following regular expressions over the alphabet {a, b}?

aUb*
(ab*Uc)
ab*Ubc*
a*bc*Uac

EDIT: Before i get downvoted like crazy, i'd appreciate it if someone could show me the steps towards solving these problems, not just the solution. Maybe even walking me through one so I can do the rest on my own.

Thanks!


Solution

  • Edit: short answer, * means "zero or more repetitions" in almost all regex/grammar syntaxes include perl5 and RFC 5234. * typically binds more tightly than concatenation and alternation.

    You say you want a language over the alphabet (a, b), but include c and U in your expressions. I'm going to assume that you want a language grammar over the alphabet (a, b, c) in a form like BNF given a regular expression where U is a low-precedence union operator, similar to | in perl re.

    In that case,

    a|b*
    

    is equivalent to the BNF:

    L := a
       | <M>
    M := epsilon
       | b <M>
    

    The * operator means zero or more, so the first clause in M is the base case, and the second clause is a recursive use of M that includes a terminal b.

    L is just either a single terminal a or the nonterminal M.

    (ab*|c)
    
    L ::= a <M>
        | c
    M ::= epsilon
        | b <M>
    

    M is derived the same way as above, and the rest is self explanatory.

    ab*|bc*
    
    L ::= a <M>
        | b <N>
    M ::= epsilon
        | b <M>
    N ::= epsilon
        | c <N>
    

    N is derived in the same way as M above.

    a*bc*|ac
    

    The * in most regular expression languages binds more tightly than concatenation, so this is the same as

    (a*b(c*))|(ac)
    

    which boils down to

    L ::= <M> b <N>
        | a c
    M ::= epsilon
        | a <M>
    N ::= epsilon
        | c <N>
    

    In general to convert a regex to BNF, you simply use adjacency in a regex to mean adjaceny in BNF, and U or | in a regex means | in BNF.

    If you define a nonterminal <X> ::= x then you can handle x* thus:

    L ::= epsilon
        | <X> <L>
    

    With the same nonterminal <X> ::= x then you can handle x+ thus:

    L ::= <X>
        | <L> <X>
    

    That gets you the repetition operators * and + which leaves ?. x? is simply

    L ::= epsilon
        | <X>