Search code examples
algorithmdata-structuressuffix-tree

String matching with an implicit representation of a suffix tree


From Data Structures and Algorithm Analysis in Java, Weiss:

explicit and implicit representation of the suffix tree for 'banana'

Weiss writes:

  1. In the leaves, we use the index where the suffix begins (as in the suffix array)
  2. In the internal nodes, we store the number of common characters matched from the root until the internal node; this number represents the letter depth.

My question: given the input string (e.g. 'banana') and the implicit representation of the suffix tree, what would a good algorithm for substring search look like? The algorithms I've seen assume a different representation of the tree. I'd like to do substring search without converting to a different tree representation.


Solution

  • I've never seen this representation before. It's more common to represent the labels on the edges as pairs of integers delineating a range of characters from the original string, which would let you more easily determine what the characters on the edges are (you can just look back at the original string at those characters on an as-needed basis to see whether they match the substring you're looking at).

    I'm fairly certain that this compressed representation is not good at matching substrings. In order to follow an edge, you need to know what characters are on that edge, but you can't tell what those characters are unless you scan over the characters of the original string to find something that could conceivably match. You could consider descending down into the subtree to find a suffix there and use that to reconstruct the characters, but this requires extra time and breaks the time bounds you'd expect of a suffix tree.

    My best guess here is that the author is mistaken about how to represent a suffix tree in a small amount of space.