java algorithm data-structures knuth-morris-pratt

Find the list of starting indexes of occurring of a string in another string in O(N) time complexity

Given are two strings : str1 and str2. Find all the starting indexes of str1 in str2. Example: I/p = str1: abc, str2: abckdabcgfacabc O/p = [0, 5, 12]

public static List<Integer> firstMatchingIndexes(String str1, String str2) {
    List<Integer> indexes = new ArrayList<>();
    int end = 0, n = str2.length();
    
    for(; end<n; end++) {
        int index = str2.substring(end, n).indexOf(str1)+end;
        if(index!=-1)
            indexes.add(index);
        else
            break;
        end =index + str1.length()-1;
        
    }
    
    return indexes;
}

But this approach uses indexOf() which internally has O(N) time complexity. Can KMP algorithm work here?

Solution

Yes, the Knuth–Morris–Pratt algorithm can be of use here.

The Wikipedia article on the Knuth–Morris–Pratt algorithm provides pseudocode for the algorithm. Here is that pseudocode ported to Java:

    static int[] kmpTable(String pattern) {
        int n = pattern.length();
        int[] partialMatchTable = new int[n+1];
        int j = 0;

        partialMatchTable[0] = -1;

        for (int i = 1; i < n; i++, j++) {
            if (pattern.charAt(i) == pattern.charAt(j)) {
                partialMatchTable[i] = partialMatchTable[j];
            } else {
                partialMatchTable[i] = j;
                while (j >= 0 && pattern.charAt(i) != pattern.charAt(j)) {
                    j = partialMatchTable[j];
                }
            }
        }
        partialMatchTable[n] = j;
        return partialMatchTable;
    }
 
    static List<Integer> kmpSearch(String needle, String haystack) {
        List<Integer> matches = new ArrayList<>();
        int m = haystack.length();
        int n = needle.length();
        if (n > m) { // Added this to avoid O(m) runtime
            return matches; // Return empty list when needle is too large
        }
        int[] partialMatchTable = kmpTable(needle);
        int j = 0, k = 0;

        while (j < m) {
            if (needle.charAt(k) == haystack.charAt(j)) {
                j++;
                k++;
                if (k == n) {
                    matches.add(j - k);
                    k = partialMatchTable[k];
                }
            } else {
                k = partialMatchTable[k];
                if (k < 0) {
                    j++;
                    k++;
                }
            }
        }
        return matches;
    }

    public static void main(String args[])
    {
        System.out.println("matches: " + kmpSearch("abc", "abckdabcgfacabc"));
    }

This outputs:

matches: [0, 5, 12]

Wikipedia says about the time complexity:

the Knuth–Morris–Pratt algorithm has complexity O(𝑛), where 𝑛 is the length of 𝑆.

And when taking into account the time needed for building the partial-match table for a pattern of length 𝑘:

Since the two portions of the algorithm have, respectively, complexities of O(𝑘) and O(𝑛), the complexity of the overall algorithm is O(𝑛 + 𝑘).

We can however say it is O(𝑛) when 𝑘 ≤ 𝑛. When searching with a pattern that is longer than the string to search in, the algorithm could just skip building the partial match table for such a search string and exit with an empty list (See commented code that is not present in Wikipedia's pseudocode). That way it is always O(𝑛).