The following is the interview question:
Machine coding round: (Time 1hr)
Expression is given and a string
testCase
, need to evaluate thetestCase
is valid or not for expressionExpression may contain:
- letters
[a-z]
'.'
('.'
represents any char in[a-z]
)'*'
('*'
has same property as in normal RegExp)'^'
('^'
represents start of the String)'$'
('$'
represents end of String)Sample cases:
Expression Test Case Valid ab ab true a*b aaaaaab true a*b*c* abc true a*b*c aaabccc false ^abc*b abccccb true ^abc*b abbccccb false ^abcd$ abcd true ^abc*abc$ abcabc true ^abc.abc$ abczabc true ^ab..*abc$ abyxxxxabc true
My approach:
Convert the given regular expression into concatenation(ab
), alteration(a|b
), (a*
) kleenstar.
And add +
for concatenation.
For example:
abc$ => .*+a+b+c
^ab..*abc$ => a+b+.+.*+a+b+c
Convert into postfix notation based on precedence.
(parantheses>kleen_star>concatenation>..
)
(a|b)*+c => ab|*c+
Build NFA based on Thompson construction
Backtracking / traversing through NFA by maintaining a set of states.
When I started implementing it, it took me a lot more than 1 hour. I felt that the step 3 was very time consuming. I built the NFA by using postfix notation +stack and by adding new states and transitions as needed.
So, I was wondering if there is faster alternative solution this question? Or maybe a faster way to implement step 3. I found this CareerCup link where someone mentioned in the comment that it was from some programming contest. So If someone has solved this previously or has a better solution to this question, I'd be happy to know where I went wrong.
Some derivation of Levenshtein distance comes to mind - possibly not the fastest algorithm, but it should be quick to implement.
We can ignore ^
at the start and $
at the end - anywhere else is invalid.
Then we construct a 2D grid where each row represents a unit [1] in the expression and each column represents a character in the test string.
[1]: A "unit" here refers to a single character, with the exception that *
shall be attached to the previous character
So for a*b*c
and aaabccc
, we get something like:
a a a b c c c
a*
b*
c
Each cell can have a boolean value indicating validity.
Now, for each cell, set it to valid if either of these hold:
The value in the left neighbour is valid and the row is x*
or .*
and the column is x
(x
being any character a-z
)
This corresponds to a *
matching one additional character.
The value in the upper-left neighbour is valid and the row is x
or .
and the column is x
(x
being any character a-z
)
This corresponds to a single-character match.
The value in the top neighbour is valid and the row is x*
or .*
.
This corresponds to the *
matching nothing.
Then check if the bottom-right-most cell is valid.
So, for the above example, we get: (V
indicating valid)
a a a b c c c
a* V V V - - - -
b* - - - V - - -
c - - - - V - -
Since the bottom-right cell isn't valid, we return invalid.
Running time: O(stringLength*expressionLength)
.
You should notice that we're mostly exploring a fairly small part of the grid.
This solution can be improved by making it a recursive solution making use of memoization (and just calling the recursive solution for the bottom-right cell).
This will give us a best-case performance of O(1)
, but still a worst-case performance of O(stringLength*expressionLength)
.
My solution assumes the expression must match the entire string, as inferred from the result of the above example being invalid (as per the question).
If it can instead match a substring, we can modify this slightly so, if the cell is in the top row it's valid if:
The row is x*
or .*
.
The row is x
or .
and the column is x
.