Search code examples
stringalgorithmtriesuffix-treelongest-substring

Why we don't use prefix tree (trie) to find longest common substring?


Recently I am learning how to use tree to solve the longest common substring problem. After learning from Wiki and other online resources, I found that we should use suffix tree to find longest common substring.

As the wiki said:

The longest common substrings of a set of strings can be found by building a generalised suffix tree for the strings, and then finding the deepest internal nodes which have leaf nodes from all the strings in the subtree below it

As Justin said:

String = ABCDE$XABCZ$
    End of word character 1 = $
    └── (0)
        ├── (20) $
        ├── (22) ABC
        │   ├── (15) DE$
        │   └── (23) Z$
        ├── (24) BC
        │   ├── (16) DE$
        │   └── (25) Z$
        ├── (26) C
        │   ├── (17) DE$
        │   └── (27) Z$
        ├── (18) DE$
        ├── (19) E$
        ├── (21) XABCZ$
        └── (28) Z$

In a (compact) suffix tree, you need to find the deepest internal node(s) which have leaf nodes from all the strings. If you have multiple nodes at the same depth, you have to compare the length of the string represented by that node. i.e. ABC, BC, and C all have the same depth, so you have to compare the length of ABC, BC, and C strings to see which is longer; which is ABC obviously.

Here I thought the process of finding deepest internal nodes which have leaf nodes from all strings is actually the process of finding longest common prefix of all suffixes from all strings.

So here's a question: Why don't we build prefix tree which store all suffixes from all strings? Then we can search prefix tree to find longest common prefix of these suffixes. I cannot tell the difference between these two. Could anyone give me some clues why we use suffix tree instead of prefix tree to solve this problem?


Solution

  • Suffix tree requires only O(N) time and space for a string with length N. That's why it is possible to solve the longest common substring problem in linear time using it.
    Adding all suffices of a string to a trie requires O(N^2) time and space in the worst case.

    So you idea of adding all suffices of all strings to a trie is actually correct, but is inefficient compared to a solution with a suffix tree.