Search code examples
javaalgorithmdata-structuresknuth-morris-pratt

Find the list of starting indexes of occurring of a string in another string in O(N) time complexity


Given are two strings : str1 and str2. Find all the starting indexes of str1 in str2. Example: I/p = str1: abc, str2: abckdabcgfacabc O/p = [0, 5, 12]

public static List<Integer> firstMatchingIndexes(String str1, String str2) {
    List<Integer> indexes = new ArrayList<>();
    int end = 0, n = str2.length();
    
    for(; end<n; end++) {
        int index = str2.substring(end, n).indexOf(str1)+end;
        if(index!=-1)
            indexes.add(index);
        else
            break;
        end =index + str1.length()-1;
        
    }
    
    return indexes;
}

But this approach uses indexOf() which internally has O(N) time complexity. Can KMP algorithm work here?


Solution

  • Yes, the Knuth–Morris–Pratt algorithm can be of use here.

    The Wikipedia article on the Knuth–Morris–Pratt algorithm provides pseudocode for the algorithm. Here is that pseudocode ported to Java:

        static int[] kmpTable(String pattern) {
            int n = pattern.length();
            int[] partialMatchTable = new int[n+1];
            int j = 0;
    
            partialMatchTable[0] = -1;
    
            for (int i = 1; i < n; i++, j++) {
                if (pattern.charAt(i) == pattern.charAt(j)) {
                    partialMatchTable[i] = partialMatchTable[j];
                } else {
                    partialMatchTable[i] = j;
                    while (j >= 0 && pattern.charAt(i) != pattern.charAt(j)) {
                        j = partialMatchTable[j];
                    }
                }
            }
            partialMatchTable[n] = j;
            return partialMatchTable;
        }
     
        static List<Integer> kmpSearch(String needle, String haystack) {
            List<Integer> matches = new ArrayList<>();
            int m = haystack.length();
            int n = needle.length();
            if (n > m) { // Added this to avoid O(m) runtime
                return matches; // Return empty list when needle is too large
            }
            int[] partialMatchTable = kmpTable(needle);
            int j = 0, k = 0;
    
            while (j < m) {
                if (needle.charAt(k) == haystack.charAt(j)) {
                    j++;
                    k++;
                    if (k == n) {
                        matches.add(j - k);
                        k = partialMatchTable[k];
                    }
                } else {
                    k = partialMatchTable[k];
                    if (k < 0) {
                        j++;
                        k++;
                    }
                }
            }
            return matches;
        }
    
        public static void main(String args[])
        {
            System.out.println("matches: " + kmpSearch("abc", "abckdabcgfacabc"));
        }
    

    This outputs:

    matches: [0, 5, 12]
    

    Wikipedia says about the time complexity:

    the Knuth–Morris–Pratt algorithm has complexity O(𝑛), where 𝑛 is the length of 𝑆.

    And when taking into account the time needed for building the partial-match table for a pattern of length 𝑘:

    Since the two portions of the algorithm have, respectively, complexities of O(𝑘) and O(𝑛), the complexity of the overall algorithm is O(𝑛 + 𝑘).

    We can however say it is O(𝑛) when 𝑘 ≤ 𝑛. When searching with a pattern that is longer than the string to search in, the algorithm could just skip building the partial match table for such a search string and exit with an empty list (See commented code that is not present in Wikipedia's pseudocode). That way it is always O(𝑛).