Search code examples
c++stringalgorithmtime-complexitysuffix-array

How to find ith substring of a string using suffix array and LCP array?


If we arrange all the distinct sub-strings of a string lexicographically and we need the i'th substring

1.) Is it possible to find it using it's suffix array and LCP array?

2.) If Yes, how do we do it ? can it be done in O(Nlog^N) while creating the suffix array using Manber & Myers which has a time complexity of O(Nlog^2N) or while creating it's LCP array using kasai's algorithm which has a time complexity of O(N)?


Solution

  • Yes it can be done using Suffix array and LCP array.

    Assuming you know how to calculate Suffix array and LCP array.

    Let p[] denote suffix array lcp[] denote LCP array.

    create a array which store the number of distinct sub string till i'th rank suffix. This can be calculated using this formula. For more details see Here

    Let cum[] denote cumulative array which can be calculated as follow:

    cum[0] = n - p[0];
    for i = 1 to n do:
        cum[i] = cum[i-1] + (n - p[i] - lcp[i])
    

    Now for finding i'th sub string just find the lower bound of i in cumulative array cum[] that will give you rank of suffix from where your sub string should start and print all character till length of

    i - cum[pos-1] + lcp[pos] // i lies between cum[pos-1] and cum[pos] so for finding 
                              // length of sub string starting from cum[pos-1] we should 
                              // subtract cum[pos-1] from i and add lcp[pos] as it is 
                              // common string between current rank suffix and 
                              // previous rank suffix. 
    

    where pos is value return by lower bound.

    Whole above process can be summarized as follow:

    string ithSubstring(int i){
        pos = lower_bound(cum , cum + n , i);
        return S.substr(arr[pos] , i - cum[pos-1] + lcp[pos]);// considering S as original character string 
    }
    

    For full implementation of Suffix array , LCP and above logic you can see Here