Search code examples
stringalgorithmsubstringtime-complexitysuffix-tree

Is it possible to count the number of distinct substrings in a string in O(n)?


Given a string s of length n, is it possible to count the number of distinct substrings in s in O(n)?

Example

Input: abb

Output: 5 ('abb', 'ab', 'bb', 'a', 'b')

I have done some research but i can't seem to find an algorithm that solves this problem in such an efficient way. I know a O(n^2) approach is possible, but is there a more efficient algorithm?

I don't need to obtain each of the substrings, just the total number of distinct ones (in case it makes a difference).


Solution

  • You can use Ukkonen's algorithm to build a suffix tree in linear time:

    https://en.wikipedia.org/wiki/Ukkonen%27s_algorithm

    The number of substrings of s is then the number of prefixes of strings in the trie, which you can calculate simply in linear time. It's just total number of characters in all nodes.

    For instance, your example produces a suffix tree like:

                /\                
               b  a
               |  b
               b  b
    

    5 characters in the tree, so 5 substrings. Each unique string is a path from the root ending after a different letter: abb, ab, a, bb, b. So the number of strings is the number of letters in the tree.

    More precisely:

    • Every substring is the prefix of some suffix of the string;
    • All the suffixes are in the trie;
    • So there is a 1-1 correspondence between substrings and paths through the trie (by the definition of trie); and
    • There is a 1-1 correspondence between letters in the tree and non-empty paths, because:
      • each distinct non-empty path ends at a distinct position after its last letter; and
      • the path to the the position following each letter is unique

    NOTE for people who are wondering how it could be possible to build a tree that contains O(N^2) characters in O(N) time:

    There's a trick to the representation of a suffix tree. Instead of storing the actual strings in the nodes of the tree, you just store pointers into the orignal string, so the node that contains "abb" doesn't have "abb", it has (0,3) -- 2 integers per node, regardless of how long the string in each node is, and the suffix tree has O(N) nodes.