Search code examples
algorithmdata-structuressuffix-tree

Generating suffixes from a Suffix Tree


I've built a suffix tree in Java based on the site here http://marknelson.us/1996/08/01/suffix-trees/ but I've run into a problem. I can build a suffix tree fine but I can trying to build a set of all suffixes from the tree. I basically find all the 'end nodes' and return the string which is represented by that 'end node'

That algorithm works for a word like "bookkeeper"

├── (1) bookkeeper
├── (9) e
│   ├── (8) eper
│   ├── (10) per
│   └── (12) r
├── (6) k
│   ├── (7) eeper
│   └── (5) keeper
├── (3) o
│   ├── (4) kkeeper
│   └── (2) okkeeper
├── (11) per
└── (13) r

Suffixes:

[bookkeeper, ookkeeper, okkeeper, kkeeper, keeper, eeper, eper, er, per, r]

But it fails when I use something like "ATATATATATA"

├── (1) ATATATATATA
└── (2) TATATATATA

Suffixes:

[ATATATATATA, TATATATATA]

But the proper suffixes should be:

[A, ATA, ATATA, ATATATA, ATATATATA, ATATATATATA, TA, TATA, TATATA, TATATATA, TATATATATA]

I can find the answer by finding all suffixes of each 'end node' string but that doesn't seem like that right approach. Any other suggestions?

EDIT: Thanks izomorphius! Adding an END_CHAR to the original string helped a bunch.

├── (21) #
├── (19) A
│   ├── (20) #
│   └── (15) TA
│       ├── (16) #
│       └── (11) TA
│           ├── (12) #
│           └── (7) TA
│               ├── (8) #
│               └── (3) TA
│                   ├── (4) #
│                   └── (1) TA#
└── (17) TA
    ├── (18) #
    └── (13) TA
        ├── (14) #
        └── (9) TA
            ├── (10) #
            └── (5) TA
                ├── (6) #
                └── (2) TA#

Suffixes:

[A, ATA, ATATA, ATATATA, ATATATATA, ATATATATATA, TA, TATA, TATATA, TATATATA, TATATATATA]

Solution

  • The typical tip on how to build a suffix tree is to add one more artificial character that you know is not in the alphabet. I usually add '#' and then I build the suffix tree for ATATATATATA# that way you will not have this problem any more.

    You get the problem you describe because the missing suffixes are actually met as prefixes of another suffix. Adding an artificial character at the end ensures this will never happen.