I have to make a Java program which finds all repeating sub-strings of length n in a given String. The input is string is extremely long and a brute-force approach takes too much time.
I alread tried:
Presently I am finding each sub-string separately and checking for repetitions of that sub-string using the KMP alogrithm. This too is taking too much time.
What is a more efficient approach for this problem?
I am going to take @peter.petrov's suggestion and enhance it by explaining how can one actually use a suffix tree to solve the problem:
1. Create a suffix tree from the string, let it be `T`.
2. Find all nodes of depth `n` in the tree, let that set of nodes be `S`. This can be done using DFS, for example.
3. For each node `n` in `S`, do the following:
3.1. Do a DFS, and count the number of terminals `n` leads to. Let this number be `count`
3.2. If `count>1`, yield the substring that is related to `n` (the path from root to `n`), and `count`
Note that this algorithm treats any substring of length n
and add it to the set S
, and from there it search for how many times this was actually a substring by counting the number of terminals this substring leads to.
This means that the complexity of the problem is O(Creation + Traversal)
- meaning, you first create the tree and then you traverse it (easy to see you don't traverse in steps 2-3 each node in the tree more than once). Since the traversal is obviously "faster" than creation of the tree - it leaves you with O(Creation)
, which as was pointed by @perer.petrov is O(|S|)
or O(|S|log|S|)
depending on your algorithm of choice.