We have two strings a
and b
respectively. The length of a
is greater than or equal to b
. We have to find out the longest common substring. If there are multiple answers then we have to output the substring which comes earlier in b
(earlier as in whose starting index comes first).
Note: The length of a
and b
can be up to 106.
I tried to find the longest common substring using suffix array (sorting the suffixes using quicksort). For the case when there is more than one answer, I tried pushing all the common substrings in a stack which are equal to the length of the longest common substring.
I wanted to know is there any faster way to do so?
Build a suffix tree of a string a$b
, that is, a
concatenated with some character like $
not occurring in both strings, then concatenated with b
. A (compressed) suffix tree can be built in O(|a|+|b|) time and memory, and have O(|a|+|b|) nodes.
Now, for each node, we know its depth (the length of the string obtained by starting from the root and traversing the tree down to that node). We also can keep track of two boolean quantities: whether this node was visited during the build phase corresponding to a
, and whether it was visited during the build phase corresponding to b
(for example, we might as well build the two trees separately and then merge them using pre-order traversal). Now, the task boils down to finding the deepest vertex which was visited during both phases, which can be done by a single pre-order traversal. The case of multiple answers should be easy to handle.
This Wikipedia page contains another (brief) overview of the technique.