If I have a generalized suffix tree and I want to insert a new string of length m
, is it possible in O(m)
?
(and the total lengths in the tree are M >> m
)
Yes it is. If you are not familiar with it yet, I suggest you read about Ukkonen's algorithm.
Now suppose you have a Generalized Suffix Tree (denoted GST from now on) containing N strings. Then, you want to insert an N+1-th string S. I will assume that you have defined a terminal character for your strings (eg. $
) and that this character will never be met in your input strings.
$
to S (i.e. from now on we consider S to be $
terminated)This first step will lead you to the starting point from where you need to work to edit the tree. It is obviously linear: O(k) when the mismatch occured at the k-th character of S.
Let us assume that S was not inserted before. The mismatch can occur in one of the following fashions:
This must remind you the iterative construction of a suffix tree and this is exactly it: we just need to point out the starting point and resume the tree construction (which will handle this dichotomy).
So you just have to resume the construction of the tree for S as if the GST was a common Suffix Tree(1). This construction is O(m-k), so total complexity is O(m-k + k) = O(m)
(1): Actually, it's a bit more complicated than that. In Ukkonen's algorithm, a substring of the string S represented by the Suffix Tree is basically a pair of indices (start, end) which denotes the first and the last character of the said substring. You need some additional machinery to keep Ukkonen's trick going when you consider a GST because you have more than one string to refer to. In other words, substrings needs to be attached to their reference string. More on that topic on this other SO question.