Search code examples
javaalgorithmsubstring

Find repeating substring of Length N


I have to make a Java program which finds all repeating sub-strings of length n in a given String. The input is string is extremely long and a brute-force approach takes too much time.

I alread tried:
Presently I am finding each sub-string separately and checking for repetitions of that sub-string using the KMP alogrithm. This too is taking too much time.

What is a more efficient approach for this problem?


Solution

  • I am going to take @peter.petrov's suggestion and enhance it by explaining how can one actually use a suffix tree to solve the problem:

     1. Create a suffix tree from the string, let it be `T`.
     2. Find all nodes of depth `n` in the tree, let that set of nodes be `S`. This can be done using DFS, for example.
     3. For each node `n` in `S`, do the following:
         3.1. Do a DFS, and count the number of terminals `n` leads to. Let this number be `count`
         3.2. If `count>1`, yield the substring that is related to `n` (the path from root to `n`), and `count`
    

    Note that this algorithm treats any substring of length n and add it to the set S, and from there it search for how many times this was actually a substring by counting the number of terminals this substring leads to.

    This means that the complexity of the problem is O(Creation + Traversal) - meaning, you first create the tree and then you traverse it (easy to see you don't traverse in steps 2-3 each node in the tree more than once). Since the traversal is obviously "faster" than creation of the tree - it leaves you with O(Creation), which as was pointed by @perer.petrov is O(|S|) or O(|S|log|S|) depending on your algorithm of choice.