Search code examples
stringalgorithmsimilaritysuffix-array

Looking for ideas: lexicographically sorted suffix array of many different strings compute efficiently an LCP array


I don't want a direct solution to the problem that's the source of this question but it's this one link:

So I take in the strings and add them to a suffix array which is implemented as a sorted set internally, what I obtain then is a lexicographically sorted list of the two given strings.

S1 = "banana"
S2 = "panama"

SuffixArray.add S1, S2

To make searching for the k-th smallest substring efficient I preprocess this sorted set to add in information about the longest common prefix between a suffix and it's predecessor as well as keeping tabs on a cumulative substrings count. So I know that for a given k greater than the cumulative substrings count of the last item, it's an invalid query.

This works really well for small inputs as well as random large inputs of the constraints given in the problem definition, which is at most 50 strings of length 2000. I am able to pass the 4 out of 7 cases and was pretty surprised I didn't get them all.

So I went searching for the bottleneck and it hit me. Given large number of inputs like these

anananananananana.....ananana
bkbkbkbkbkbkbkbkb.....bkbkbkb

The queries for k-th smallest substrings are still fast as expected but not the way I preprocess the sorted set... The way I calculate the longest common prefix between the elements of the set is not efficient and linear O(m), like this, I did the most naïve thing expecting it to be good enough:

m = anananan
n = anananana

Start at 0 and find the point where `m[i] != n[i]`

It is like this because a suffix and his predecessor might no be related (i.e. coming from different input strings) and so I thought I couldn't help but using brute force.

Here is the question then and where I ended up reducing the problem as. Given a list of lexicographically sorted suffix like in the manner I described above (made up of multiple strings):

What is an efficient way of computing the longest common prefix array?.

The subquestion would then be, am I completely off the mark in my approach? Please propose further avenues of investigation if that's the case.

Foot note, I do not want to be shown implemented algorithm and I don't mind to be told to go read so and so book or resource on the subject as that is what I do anyway while attempting these challenges.

Accepted answer will be something that guides me on the right path or in the case that that fails; something that teaches me how to solve these types of problem in a broader sense, a book or something


Solution

  • READING

    I would recommend this tutorial pdf from Stanford.

    This tutorial explains a simple O(nlog^2n) algorithm with O(nlogn) space to compute suffix array and a matrix of intermediate results. The matrix of intermediate results can be used to compute the longest common prefix between two suffixes in O(logn).

    HINTS

    If you wish to try to develop the algorithm yourself, the key is to sort the strings based on their 2^k long prefixes.

    From the tutorial:

    Let's denote by A(i,k) be the subsequence of A of length 2^k starting at position i. The position of A(i,k) in the sorted array of A(j,k) subsequences (j=1,n) is kept in P(k,i).

    and

    Using matrix P, one can iterate descending from the biggest k down to 0 and check whether A(i,k) = A(j,k). If the two prefixes are equal, a common prefix of length 2^k had been found. We only have left to update i and j, increasing them both by 2^k and check again if there are any more common prefixes.