Search code examples
c++stringalgorithmhashsuffix-array

How can I find the occurence number of each suffix in a string?


I want to find how many times each suffix of a string occurs in the original string in O(nlogn) or O(n) time.

For example, for string aba, suffix a appears twice, ba appears once, aba appears once.


Solution

  • Suffix Array Solution

    Construct suffix tree of the string S along with LCP array. This will help in counting all the occurrences of each suffix.

    without learning what suffix array and LCP are, its difficult to understand.

    suffix array

    LCP

    kasai’s Algorithm for Construction of LCP array from Suffix Array

    Let us take an example string and create its suffix array. Consider the string S = "ABABBAABB".

    suffix positions(pos)   Suffixes of S   LCP array of S
        5                   AABB            1
        0                   ABABBAABB       2
        6                   ABB             3
        2                   ABBAABB         0
        8                   B               1
        4                   BAABB           2
        1                   BABBAABB        1
        3                   BBAABB          2
        7                   BB              not Defined
    

    First column(pos array) is the original starting points of sorted suffixes in Suffix Array. Let call second column as SuffixArray (we do not need to compute it, its just for visualization).

    Now, as we know LCP[i]= the length of longest common prefix between SuffixArray[i] and SuffixArray[i+1]. e.g. LCP1=lcp("ABABBAABB","ABB")=2.

    Let Count[i] = number of occurrences of suffix starting at position i.

    for (int i = 0; i < n; )
    {
        int j=i;
        while(LCP[j]==n-pos[j]){ // loop if SuffixArray[j] is a prefix of SuffixArray[j+1] 
            j++;
        }
        int incr=1;
        for (int k = j-1; k>= i ; --k)
        {
            count[ pos[k] ] = incr;
            incr++;
        } 
        i=j+1;
    }
    

    This is highly optimized solution and if you look closely towards all steps, Complexity is O(n log n).

    Hope it helps. Please go through everything again if you do not understand in the first try.



    EDIT: There is tiny bug in this computation of count array. Basically my problem is to find the immediate next index in the LCP array which is smaller than current value. I am providing the correct implementation .

    stack< int > stack;
    
    count[ pos[n-1] ] = 1;
    
    for(int i=n-2;i>=0;i--){
        while(!stack.empty() and LCP[stack.top()]>=LCS[i]){
            stack.pop();
        }
    
        if( LCP[i] == n-pos[i]  ){
            if (stack.empty())
            {
                count[ pos[i] ] = n-i ;
            }else{
                count[ pos[i] ] = stack.top()-i ;
            }
    
        }else{
            count[ pos[i] ] = 1;
        }
    
        stack.push(i);
    
    }
    

    next smaller element in array