Search code examples
algorithmsuffix-tree

Can we use circular strings with Suffix Trees?


Can we use circular strings with Suffix Trees? So the last character is followed by the first in the list.

If so, how is the representation of this suffix tree different from a normal suffix tree?


Solution

  • It depends on what you mean by "use".

    1) Firstly, interpreting your question in the most direct possible way, consider a circular string of length n, i.e. an infinite string that repeats itself every n characters. Such an object has no suffixes in the usual sense of the word because it never ends, so you can't construct a suffix tree of it.

    2) However, the idea certainly is that we have a finite representation of the circular string, which uses a link from the last character to the first. In a similar way, we can extend a given suffix tree through the use of links to a circular suffix tree that represents all the (infinitely long) suffixes of the circular string. Note that this cannot be done by inserting a link from each leaf to the root of the node, because from the root, there are outgoing edges for all suffixes of the string, but from a leaf of such a circular string, there can be only one outgoing edge. Example: The leaf that represents the suffix "ssippi$" of "mississippi$" should have an outgoing edge carrying an infinite label "mississippi$mississippi$mississippi$...." and no other edges. If you were to link it to the root of the tree, there would be many more incorrect continuations.

    So two things are necessary:

    • Outgoing edges from leaves (which is a funny concept after all). Each leaf gets one outgoing edge.
    • Edges carrying infinite labels. This labels could be represented as circular strings (and that circular string would be the same for all outgoing edges of leaves).

    This would give you a valid representation of all (infinite) suffixes of the circular string.

    3) I am not really sure if that representation would be useful for anything. If the purpose of constructing the suffix tree is to enable substring searches, then the usual trick of concatenating the finite representation of the circular string (not including the link) to itself and constructing a suffix tree from this should suffice unless the substrings you search for are themselves longer than n characters.

    It is also important to note that certain other uses of a suffix tree would require the introduction of further "infinite" concepts. For example, for certain applications it may be required to store the character depth of a tree node (i.e. the combined length of the edge labels leading from root to a particular node) in that node. In the "circular suffix tree" proposed above, the outgoing edges of leaves would lead to some sort of special "leaf in the limit" and carry a circular string as label. Any query that is matched into such a circular string would require a special way to keep track of the matching depth as there are no inner nodes on that edge to store the depth information.

    4) Having said all of this, there is actually at least one known application of suffix trees to circular strings, but not in the sense of (1), (2) or (3) above, i.e. representing the entire infinite object through the use of a suffix tree. Rather, a suffix tree of a finite substring of the circular string is used to solve the problem of lexicographically minimal rotation. The problem is described on Wikipedia, although the solutions listed there do not include any that makes use of suffix trees. However, Dan Gusfield describes the solution in Algorithms on Strings, Trees and Sequences, in section 7.13.

    The idea is that you consider the set of lexicographically minimal rotations of a string S of length n as equivalent to the set of the first length-n substrings of a circular string. The problem is then equivalent to that of finding a lexicographically minimal cut-off point. Gusfield solves it by constructing a suffix tree of the string SS$, traverses this tree by taking the lexicographically smallest edge at each node and thus ending up at a node that corresponds to the lexicographically smallest cut-off point.

    So, as (4) demonstrates, there are certain practical "uses" of suffix trees in the context of circular strings, but I am unsure if that is the kind of use you are interested in.