Search code examples
c++algorithmsuffix-tree

how to find the indexes of all matching substring using suffix tree?


I created a suffix tree from this amazing answer. It works like a charm!

For now, if I look for "cat" in "This cat is a pretty cat", it will return 5 as "cat" first appearance as for starting index 5.

But I can't find a way to keep track of all the suffixes in the algorithm to create. So basically, I can find the index of the first match, but not all the different occurrences.

For now, I have:

class Edge
{
    int textIndexFrom;
    Node* nodefrom;
    Node* nodeTo;
    int textIndexTo;
}

class Node
{
    std::map<char,Edge*> m_childEdges;
    Edge* m_pParentEdge;
    Node* m_pLinkedNode;
}

I just put the relevant variables in the code above. To store the different starting positions, I imagine a std::vector is needed in Edge, but I don't see when to add a new index. We might use the active point but with the suffix links, it becomes tricky.

Could someone explain?


Solution

  • I assume you constructed a suffix tree for the string S$ where $ is some special character not present in S. The $ char ensures that each suffix has its own leaf in the tree. The number of occurances of word w in S is the number of leaves in the subtree of w.

    I think that storing all starting positions in each edge/node would require quadratic memory. For example if T happens to be perfectly balanced binary tree then on level 0 (root) you have n entries, on level 1 you have 2 * n/2 entries and so on. After summing it gives us n^2. It requires proving so please correct me if I'm wrong.

    Instead I think its better to keep all the leaves in a list in order they appear in dfs traversal of the tree (left to right if you draw a picture of the tree). And in every node v keep 2 pointers to the elements of that list. First should point to the first leaf of v's subtree and second to the last leaf of v's subtree. All that can be computed by simple dfs procedure. Now if for example 'cat' happens to be in the middle of some edge then go through that edge to some node v and get leaves from that node. In addition in every leaf you should store the length of the path from root to that leaf. It will help you find the position of that particular 'cat' occurance.