Search code examples
regexalgorithmfinite-automata

Interview: Machine coding / regex (Better alternative to my solution)


The following is the interview question:

Machine coding round: (Time 1hr)

Expression is given and a string testCase, need to evaluate the testCase is valid or not for expression

Expression may contain:

  • letters [a-z]
  • '.' ('.' represents any char in [a-z])
  • '*' ('*' has same property as in normal RegExp)
  • '^' ('^' represents start of the String)
  • '$' ('$' represents end of String)

Sample cases:

Expression   Test Case   Valid
ab           ab          true 
a*b          aaaaaab     true 
a*b*c*       abc         true 
a*b*c        aaabccc     false 
^abc*b       abccccb     true 
^abc*b       abbccccb    false 
^abcd$       abcd        true 
^abc*abc$    abcabc      true 
^abc.abc$    abczabc     true 
^ab..*abc$   abyxxxxabc  true

My approach:

  1. Convert the given regular expression into concatenation(ab), alteration(a|b), (a*) kleenstar.
    And add + for concatenation.
    For example:

    abc$  =>  .*+a+b+c
    ^ab..*abc$  => a+b+.+.*+a+b+c
    
  2. Convert into postfix notation based on precedence.
    (parantheses>kleen_star>concatenation>..)

    (a|b)*+c  =>  ab|*c+
    
  3. Build NFA based on Thompson construction

  4. Backtracking / traversing through NFA by maintaining a set of states.

When I started implementing it, it took me a lot more than 1 hour. I felt that the step 3 was very time consuming. I built the NFA by using postfix notation +stack and by adding new states and transitions as needed.

So, I was wondering if there is faster alternative solution this question? Or maybe a faster way to implement step 3. I found this CareerCup link where someone mentioned in the comment that it was from some programming contest. So If someone has solved this previously or has a better solution to this question, I'd be happy to know where I went wrong.


Solution

  • Some derivation of Levenshtein distance comes to mind - possibly not the fastest algorithm, but it should be quick to implement.

    We can ignore ^ at the start and $ at the end - anywhere else is invalid.

    Then we construct a 2D grid where each row represents a unit [1] in the expression and each column represents a character in the test string.

    [1]: A "unit" here refers to a single character, with the exception that * shall be attached to the previous character

    So for a*b*c and aaabccc, we get something like:

       a a a b c c c
    a*
    b*
    c
    

    Each cell can have a boolean value indicating validity.

    Now, for each cell, set it to valid if either of these hold:

    • The value in the left neighbour is valid and the row is x* or .* and the column is x (x being any character a-z)

      This corresponds to a * matching one additional character.

    • The value in the upper-left neighbour is valid and the row is x or . and the column is x (x being any character a-z)

      This corresponds to a single-character match.

    • The value in the top neighbour is valid and the row is x* or .*.

      This corresponds to the * matching nothing.

    Then check if the bottom-right-most cell is valid.

    So, for the above example, we get: (V indicating valid)

       a a a b c c c
    a* V V V - - - -
    b* - - - V - - -
    c  - - - - V - -
    

    Since the bottom-right cell isn't valid, we return invalid.

    Running time: O(stringLength*expressionLength).


    You should notice that we're mostly exploring a fairly small part of the grid.

    This solution can be improved by making it a recursive solution making use of memoization (and just calling the recursive solution for the bottom-right cell).

    This will give us a best-case performance of O(1), but still a worst-case performance of O(stringLength*expressionLength).


    My solution assumes the expression must match the entire string, as inferred from the result of the above example being invalid (as per the question).

    If it can instead match a substring, we can modify this slightly so, if the cell is in the top row it's valid if:

    • The row is x* or .*.

    • The row is x or . and the column is x.