Search code examples
arraysalgorithmprefixsuffixknuth-morris-pratt

Why/how does the Longest Proper Prefix/Suffix algorithm work?


The LPS (Longest Proper Prefix which is also a Suffix) algorithm goes as follows:

public static int[] constructLPSArray(String s) {
        int n = s.length();
        int[] arr = new int[n];
        int j = 0;
        for (int i = 1; i < n; ) {
            if (s.charAt(i) == s.charAt(j)) {
                arr[i] = j + 1;
                i++;
                j++;
            } else {
                if (j != 0) {
                    j = arr[j - 1];
                } else {
                    i++;
                }
            }
        }
        return arr;
    }

The if (s.charAt(i) == s.charAt(j)) part looks clear, but the else part is unclear. Why do we do:

if (j != 0) {
  j = arr[j - 1];
} else {
  i++;
}

More specifically, why does j = arr[j - 1] work ? Or why do we even do it? How do we validate the correctness of this step?


Solution

  • Let's say we are parsing an array of characters with i and j positioned like this:

    a b a b x x a b a b ...
          ^           ^
          j           i
    

    with arr holding:

    0 0 1 2 0 0 1 2 3 4
    

    i. e., the length of the longest prefix/suffix of each substring of s of that length until i. You can probably guess how that was generated from the rest of the algorithm. Now, if the next character after i does not match the next character after j,

    a b a b x x a b a b a ...
            ^           ^
            j           i
    

    we don't have to retry the matching, because we know the longest prefix/suffix of our previous prefix/suffix! Looking up arr[j - 1] yields 2 – so we essentially cached the information that the parts highlighted here

    A B a b x x a b A B a ...
    === ^           === ^
        j               i
    

    are identical, and don't need to be compared again!