Search code examples
calgorithmsuffix-arraycounting-sort

Suffix Array Construction O(N LogN) - Competitive Programming 3 Steven Halim


I reading up the book Competitive Programming 3 by Steven Halim and Felix Halim

I'm reading the chapter on Strings.I'm trying to understand the suffix array construction algorithm. I dont understand the radix sort part. (Although, I understand how radix sort and counting sort works)

Here is the code from the book

#define MAX_N 100010 // second approach: O(n log n)
char T[MAX_N]; // the input string, up to 100K characters
int n; // the length of input string

int RA[MAX_N], tempRA[MAX_N]; // rank array and temporary rank array
int SA[MAX_N], tempSA[MAX_N]; // suffix array and temporary suffix array

int c[MAX_N]; // for counting/radix sort

void countingSort(int k) { // O(n)

    int i, sum, maxi = max(300, n); // up to 255 ASCII chars or length of n
    memset(c, 0, sizeof c); // clear frequency table

    for (i = 0; i < n; i++){ // count the frequency of each integer rank
        c[i + k < n ? RA[i + k] : 0]++;
    }
    for (i = sum = 0; i < maxi; i++) {
        int t = c[i]; c[i] = sum; sum += t; 
    }
    for (i = 0; i < n; i++){ // shuffle the suffix array if necessary
        tempSA[c[SA[i]+k < n ? RA[SA[i]+k] : 0]++] = SA[i];
    }
    for (i = 0; i < n; i++){ // update the suffix array SA
        SA[i] = tempSA[i];
    }
}

void constructSA() { // this version can go up to 100000 characters
    int i, k, r;
    for (i = 0; i < n; i++) RA[i] = T[i]; // initial rankings
    for (i = 0; i < n; i++) SA[i] = i; //initial SA: {0, 1, 2, ..., n-1}

    for (k = 1; k < n; k <<= 1) { // repeat sorting process log n times
        countingSort(k); //actually radix sort:sort based on the second item
        countingSort(0); // then (stable) sort based on the first item

        tempRA[SA[0]] = r = 0; // re-ranking; start from rank r = 0

        // compare adjacent suffixes
        for (i = 1; i < n; i++){
            // if same pair => same rank r; otherwise,increase r
            tempRA[SA[i]] = (RA[SA[i]] == RA[SA[i-1]] && RA[SA[i]+k] == RA[SA[i-1]+k]) ? r : ++r;           
        }

        for (i = 0; i < n; i++){// update the rank array RA
            RA[i] = tempRA[i];
        }

        if (RA[SA[n-1]] == n-1) break; // nice optimization trick
    } 
}

Can somebody please explain what is happening the in these lines of the countingSort() function?

for (i = sum = 0; i < maxi; i++) {
    int t = c[i]; c[i] = sum; sum += t; 
}
for (i = 0; i < n; i++){ // shuffle the suffix array if necessary
    tempSA[c[SA[i]+k < n ? RA[SA[i]+k] : 0]++] = SA[i];
}
for (i = 0; i < n; i++){ // update the suffix array SA
    SA[i] = tempSA[i];
}

Thanks a lot for your valuable time.


Solution

  • First compute the startIndex for each unique ranking.

    REMARK: c[] here stands for a ranking and not just for an individual character.

    // compute cumulates of rankings
    for (i = sum = 0; i < maxi; i++) {
        int t = c[i]; c[i] = sum; sum += t; 
    }
    

    Reorder the Suffix array using the just computed startIndices. Based on the ranking of the SA[i]+k suffix.

    // shuffle the suffix array if necessary
    for (i = 0; i < n; i++){ 
        tempSA[c[SA[i]+k < n ? RA[SA[i]+k] : 0]++] = SA[i];
    }
    

    Copy back the updated values from the temp array

    // copy the updated values back to SA
    for (i = 0; i < n; i++){ 
        SA[i] = tempSA[i];
    }
    

    This means that the suffix starting at position i is sorted by the ranking of the the suffix at place (i+k).

    We sort each suffix of length k, by the suffix of length k at place i+k. We can do this because at the previous iteration all suffixes were sorted for length k.

    After that we sort again from the first index. Which was holding the ranking for size k. Since the sorting is stable, all suffixes are now sorted for length k*2.

    Our next step is to update the ranking if two contiguous suffix arrays in the ranking are not equal anymore.

    for (i = 1; i < n; i++){
        // if same pair => same rank r; otherwise,increase r
        tempRA[SA[i]] = (RA[SA[i]] == RA[SA[i-1]] && RA[SA[i]+k] == RA[SA[i-1]+k]) ? r : ++r;           
    }
    

    If the ranking for size k at their startIndex is the same and the ranking at their startIndex+k is the same. Then the ranking at the startIndex is the same for size k*2.

    This should also explain the following:

    if (RA[SA[n-1]] == n-1) break; // nice optimization trick
    

    This means that at that point the rankings for the current size are all unique. So all suffixes are unique as well and no further sorting is required.


    Stepped example:

      a   b   c   x   a   b   c   d 
    --------------------------------INIT-
      0   1   2   3   4   5   6   7 // SA
     97  98  99 120 97  98  99  100 // RA
    ---------------------------------K=1-
      0   2   5   7   1   3   4   6 // SA
      0   1   2   4   0   1   2   3 // RA
    ---------------------------------K=2-
      1   3   5   7   0   2   4   6 // SA
      1   3   5   7   0   2   4   6 // RA
    

    countintSort Example for step K=1:

    // count frequencies
    c['a']=2;
    c['b']=2;
    c['c']=2;
    c['d']=1;
    c['x']=1;
    
    // switch them to startindices
    c['a']=0;
    c['b']=2;
    c['c']=4;
    c['d']=6; // e.g. in total there are 6 suffixes smaller than starting with d (2 x a, 2 x b, 2 x c)
    c['x']=7;
    
    // determine the new SA position
    tempSA[c[rank(SA[i]+k)]++] = SA[i];
    // decomposing first iteration
    tempSA[c[rank(SA[0]+k)]++] = SA[0]; // i = 0
    tempSA[c[rank(SA[0]+1)]++] = SA[0]; // k = 1
    tempSA[c[rank(1)]++] = 0; // SA[0] = 0
    tempSA[c['b']++] = 0; // rank(1) = 'B'
    tempSA[2] = 0; // c['b']=2 => 2++ = 3
    

    In other words: put current first suffix array at the startIndex of the suffixArray that starts k places after. And increase that startIndex by one so the next occurence won't override.

    // all other iterations resulting in:
    tempSA[0] = 7 // d (sorted by EMPTY)
    tempSA[1] = 3 // x (sorted by a)
    tempSA[2] = 0 // a (sorted by b)
    tempSA[3] = 4 // a (sorted by b)
    tempSA[4] = 1 // b (sorted by c)
    tempSA[5] = 5 // b (sorted by c)
    tempSA[6] = 6 // c (sorted by d) 
    tempSA[7] = 2 // c (sorted by d)
    
    // last step is simply copying those values to SA (I suppose you know why this is)
    

    This is all I can give you, if you still have troubles try to go through it with the debugger or print out subresults where you have doubts.