Search code examples
javadynamic-programmingsubsequence

How to modify a longest common subsequence algorithm to a longest common contiguous subsequence algorithm?


I'm wondering how to change the class I have so that it can produce the longest contiguous common subsequence between two strings. This code right now is setup so that I can get the length of the longest common subsequence of the two strings. I want the final result string to be unbroken (it can be found in both strings with no characters in between). I would like it if, for example, sending the method ACGGTTGTCGCAGTCC and TGTAGCAG would result in length 4 for GCAG, instead of what it does now, which is length 7 for TGTGCAG.

public class GenomeSequencer {
    private String x;
    private String y;
public String getLongestCommon() {
        int[][] table = lcsLength();
        return Integer.toString(table[x.length()][y.length()]);
    }

    private int[][] lcsLength() {
        int m = x.length();
        int n = y.length();

        int[][] c = new int[m + 1][n + 1];

        for (int i = 0; i <= m; i++) {
            c[i][0] = 0;
        }
        for (int j = 0; j <= n; j++) {
            c[0][j] = 0;
        }
        for (int i = 1; i <= m; i++) {
            for (int j = 1; j <= n; j++) {
                if (x.charAt(i - 1) == y.charAt(j - 1)) {
                    c[i][j] = c[i - 1][j - 1] + 1;
                } else if (c[i - 1][j] >= c[i][j - 1]) {
                    c[i][j] = c[i - 1][j];
                } else {
                    c[i][j] = c[i][j - 1];
                }
            }
        }
        return c;

    }

}

Solution

  • If you really want to reuse the old solution, consider why it doesn't work anymore.

    Of the three transitions, the one when letters are equal should survive, and the two skipping a letter in one of the strings should disappear.

    So, the transitions can be just:

                    if (x.charAt(i - 1) == y.charAt(j - 1)) {
                        c[i][j] = c[i - 1][j - 1] + 1;
                    } else {
                        c[i][j] = 0;
                    }
    

    Note that, with this approach, the answer is no longer at c[m][n]: instead, you should search the whole table for the maximum value.


    The simplicity of the transitions suggests there are faster solutions to the longest common substring problem (substring is a consecutive subsequence). And indeed there is a linear yet complex solution involving a suffix tree (mentioned in the above link), or simpler solutions using hashes and binary search for the answer's length.