Search code examples
algorithmlongest-substring

Is this method of Longest common substring correct?


I have found an algorithm for Longest Common Substring. It is usually done using dynamic programming, using a 2-D array of size mxn where m and n are lengths of the two strings under consideration.

I will construct the following matrix for the two strings.

M[i][j] = 1 if s1[i]==s2[j] else 0.

For example, if the strings are: abcxy and pqaabx

The matrix looks as follows:

    a b c x y
 p  0 0 0 0 0
 q  0 0 0 0 0
 a  1 0 0 0 0
 a  1 0 0 0 0
 b  0 1 0 0 0
 x  0 0 0 1 0

Now, I search for a maximal continuous sequence of 1s in every diagonal which is in top-left to bottom-right direction.

The maximum value among these will be the answer.

I can perform the above operation without using the array explicitly. The time-complexity is still O(M*N). So, there is no need of memory.

Can anyone point me where I am going wrong?


Solution

  • Your method is correct. For proof suppose the longest common substring for S1 and S2 was from S1[i..j] and S2[p..q]. this implies S1[i+k] = S2[p+k]

    These all lie on the diagonal starting from (i,p).

    The dynamic programming solution does the same thing but instead of computing the table first and going through diagonal paths it computes the table depending on it's diagonal parent plus whether or not they match.

    EDITED

    On your comment on the wikipedia solution using additional memory. It's there only for clarity. In principle you need only two rows of the matrix in the wikipedia solution and keep the current maximum count in one variable. This is correct since for any (i,j)th entry in the matrix

    M(i,j) = 1 + M(i-1, j-1) (if s1[i] == s2[j])

    as you can see the current row elements depend only on the elements of the immediately upper row.