Search code examples
regular-language

Given Regular Language, find Regular Expression


The language is as follows.

∑={a,b,c}

L = the second and third to last characters of ω are the same, ω has length greater than 5 and ω contains ccc.

I tried doing it and I'm not sure if it's right. Got the following:

((ccc)(aUbUc)*(a)(a)(aUbUC)) U ((ccc)(aUbUc)*(b)(b)(aUbUC)) U ((ccc)(aUbUc)*(c)(c)(aUbUC))

Is this correct?


Solution

  • Nope, that's not correct. For example, your expression does not recognize the string aaccc, because all of your subexpressions require that your strings start with ccc, which is different from what the language description would indicate.

    Some of your expression is right, such as needing to split out the aa, bb and cc parts. You are overusing parentheses a bit, but that's just a matter of taste rather than correctness.

    Your basic unit is (aUbUc), representing . A string must contain ccc somewhere in it, so let's start with just that:

    (aUbUc)*ccc(aUbUc)*
    

    This is all strings containing ccc. The next requirement is complicated: the third-from-last and second-from-last characters have to be the same. This could overlap with the ccc part. If it didn't, this would be sufficient:

    (aUbUc)*ccc(aUbUc)*(aaUbbUcc)(aUbUc)
    

    But this doesn't allow us to have a string like accca or aaccc. Note, however, that it does require all strings to be at least length 6, so it satisfies the requirement that strings be of length greater than 5. I'm going to abbreviate and use instead of (aUbUc) to make this smaller:

    (∑*ccc∑*(aaUbbUcc)∑)U(∑∑∑∑*ccc)U(∑∑∑*ccc∑)
    

    Note the extra s, which are required to pad out the other sub-expressions to make sure all paths have string length greater than 5.

    An alternative to coming up with a regular expression directly is to build a DFA that matches this language, and then convert it to a regular expression. You'll find similar concerns when building your DFA, such as needing to make sure that you cover the overlap case, where ccc is near the end of the string. To make that a bit easier, you can start with an NFA and then convert your NFA into a DFA.