Search code examples
regexcomputer-scienceregular-language

Simplification of REGEX expression


I need to prove or disprove the below REGEX

(RS + R )* R = R (SR + R)*
// or, for programmers:
/(RS|R)*R/ == /R(SR|R)*/

I have a strong gut feel that they are equivalent, but how do i give a step by step proof using the laws of REGEX.


Solution

  • First understand what this formal languages mean:

    (RS + R)*R = R(SR + R)*
    

    From LHS, (RS + R)* uses to generate any combinations of RS and R including ^ epsilon. Some example strings are {^, RS, RSRS, RRRS, RSR,...}: strings always starts from R but can end with either S or R - we can describe in English: R can appear in any combination where S is always followed by one R (two consecutive S are not possible).

    And, complete LHS's re (RS + R)*R means string always terminates with R.

    Now, consider following examples:

    1. R + S is same as S + R, it is basically union
    2. But RS can't be written as SR, order is important in concatenation
    3. (RS)R can be written as R(SR)
    4. (RS)*R can be written as R(SR)*, both are same that is RSRSRS...SR
    5. (AB + AC) can be written as A(B + C)
    6. (AB + A) can be written as A(B + ^), this is because A = ^A = A^
    7. (BA + A) can be written as (B + ^)A.

    Formal Proof:

       (RS + R)*R      // LHS
    => (R(S + ^))*R    // from rule 6
    => R((S + ^)R)*    // from rule 4
    => R(SR + R)*      // from rule 7, in revers `(B + ^)A` --> `(BA + A)`
    // RHS 
    

    Same steps are correct for regex.