Interview: Machine coding / regex (Better alternative to my solution)

The following is the interview question:

Machine coding round: (Time 1hr)

Expression is given and a string testCase, need to evaluate the testCase is valid or not for expression

Expression may contain:

letters [a-z]

'.' ('.' represents any char in [a-z])

'*' ('*' has same property as in normal RegExp)

'^' ('^' represents start of the String)

'$' ('$' represents end of String)

Sample cases:
Expression   Test Case   Valid
ab           ab          true 
a*b          aaaaaab     true 
a*b*c*       abc         true 
a*b*c        aaabccc     false 
^abc*b       abccccb     true 
^abc*b       abbccccb    false 
^abcd$       abcd        true 
^abc*abc$    abcabc      true 
^abc.abc$    abczabc     true 
^ab..*abc$   abyxxxxabc  true

My approach:

Convert the given regular expression into concatenation(ab), alteration(a|b), (a*) kleenstar.
And add + for concatenation.
For example:
```
abc$  =>  .*+a+b+c
^ab..*abc$  => a+b+.+.*+a+b+c
```
Convert into postfix notation based on precedence.
(parantheses>kleen_star>concatenation>..)
```
(a|b)*+c  =>  ab|*c+
```
Build NFA based on Thompson construction
Backtracking / traversing through NFA by maintaining a set of states.

When I started implementing it, it took me a lot more than 1 hour. I felt that the step 3 was very time consuming. I built the NFA by using postfix notation +stack and by adding new states and transitions as needed.

So, I was wondering if there is faster alternative solution this question? Or maybe a faster way to implement step 3. I found this CareerCup link where someone mentioned in the comment that it was from some programming contest. So If someone has solved this previously or has a better solution to this question, I'd be happy to know where I went wrong.

Solution

Some derivation of Levenshtein distance comes to mind - possibly not the fastest algorithm, but it should be quick to implement.

We can ignore ^ at the start and $ at the end - anywhere else is invalid.

Then we construct a 2D grid where each row represents a unit ^[1] in the expression and each column represents a character in the test string.

[1]: A "unit" here refers to a single character, with the exception that * shall be attached to the previous character

So for a*b*c and aaabccc, we get something like:

   a a a b c c c
a*
b*
c

Each cell can have a boolean value indicating validity.

Now, for each cell, set it to valid if either of these hold:

The value in the left neighbour is valid and the row is x* or .* and the column is x (x being any character a-z)

This corresponds to a * matching one additional character.
The value in the upper-left neighbour is valid and the row is x or . and the column is x (x being any character a-z)

This corresponds to a single-character match.
The value in the top neighbour is valid and the row is x* or .*.

This corresponds to the * matching nothing.

Then check if the bottom-right-most cell is valid.

So, for the above example, we get: (V indicating valid)

   a a a b c c c
a* V V V - - - -
b* - - - V - - -
c  - - - - V - -

Since the bottom-right cell isn't valid, we return invalid.

Running time: O(stringLength*expressionLength).

You should notice that we're mostly exploring a fairly small part of the grid.

This solution can be improved by making it a recursive solution making use of memoization (and just calling the recursive solution for the bottom-right cell).

This will give us a best-case performance of O(1), but still a worst-case performance of O(stringLength*expressionLength).

My solution assumes the expression must match the entire string, as inferred from the result of the above example being invalid (as per the question).

If it can instead match a substring, we can modify this slightly so, if the cell is in the top row it's valid if:

The row is x* or .*.
The row is x or . and the column is x.