Search code examples
algorithmlistsuffix-array

How to Modify a Suffix Array to search multiple strings?


I've recently been updating my knowledge of algorithms and have been reading up on suffix arrays. Every text I've read has defined them as an array of suffixes over a single search string, but some articles have mentioned its 'trivial' to generalize to an entire list of search strings, but I can't see how.

Assume I'm trying to implement a simple substring search over a word list and wish to return a list of words matching a given substring. The naive approach would appear to be to insert the lexicographic end character '$' between words in my list, concatenate them all together, and produce a suffix tree from the result. But this would seem to generate large numbers of irrelevant entries. If I create a source string of 'banana$muffin' then I'll end up generating suffixes for 'ana$muffin' which I'll never use.

I'd appreciate any hints as to how to do this right, or better yet, a pointer to some algorithm texts that handle this case.


Solution

  • After having now read through most of the book Algorithms on Strings, Trees and Sequences by Dan Gusfield, the answer seems clear.

    If you start with a multi-string suffix tree, one of the standard conversion algorithms will still work. However, instead of having getting an array of integers, you end up with an array of lists. Each lists contains one or more pairs of a string identifier and a starting offset in that string.

    The resulting structure is still useful, but not as efficient as a normal suffix array.