I reading up the book Competitive Programming 3 by Steven Halim and Felix Halim
I'm reading the chapter on Strings.I'm trying to understand the suffix array construction algorithm. I dont understand the radix sort part. (Although, I understand how radix sort and counting sort works)
Here is the code from the book
#define MAX_N 100010 // second approach: O(n log n)
char T[MAX_N]; // the input string, up to 100K characters
int n; // the length of input string
int RA[MAX_N], tempRA[MAX_N]; // rank array and temporary rank array
int SA[MAX_N], tempSA[MAX_N]; // suffix array and temporary suffix array
int c[MAX_N]; // for counting/radix sort
void countingSort(int k) { // O(n)
int i, sum, maxi = max(300, n); // up to 255 ASCII chars or length of n
memset(c, 0, sizeof c); // clear frequency table
for (i = 0; i < n; i++){ // count the frequency of each integer rank
c[i + k < n ? RA[i + k] : 0]++;
}
for (i = sum = 0; i < maxi; i++) {
int t = c[i]; c[i] = sum; sum += t;
}
for (i = 0; i < n; i++){ // shuffle the suffix array if necessary
tempSA[c[SA[i]+k < n ? RA[SA[i]+k] : 0]++] = SA[i];
}
for (i = 0; i < n; i++){ // update the suffix array SA
SA[i] = tempSA[i];
}
}
void constructSA() { // this version can go up to 100000 characters
int i, k, r;
for (i = 0; i < n; i++) RA[i] = T[i]; // initial rankings
for (i = 0; i < n; i++) SA[i] = i; //initial SA: {0, 1, 2, ..., n-1}
for (k = 1; k < n; k <<= 1) { // repeat sorting process log n times
countingSort(k); //actually radix sort:sort based on the second item
countingSort(0); // then (stable) sort based on the first item
tempRA[SA[0]] = r = 0; // re-ranking; start from rank r = 0
// compare adjacent suffixes
for (i = 1; i < n; i++){
// if same pair => same rank r; otherwise,increase r
tempRA[SA[i]] = (RA[SA[i]] == RA[SA[i-1]] && RA[SA[i]+k] == RA[SA[i-1]+k]) ? r : ++r;
}
for (i = 0; i < n; i++){// update the rank array RA
RA[i] = tempRA[i];
}
if (RA[SA[n-1]] == n-1) break; // nice optimization trick
}
}
Can somebody please explain what is happening the in these lines of the countingSort() function?
for (i = sum = 0; i < maxi; i++) {
int t = c[i]; c[i] = sum; sum += t;
}
for (i = 0; i < n; i++){ // shuffle the suffix array if necessary
tempSA[c[SA[i]+k < n ? RA[SA[i]+k] : 0]++] = SA[i];
}
for (i = 0; i < n; i++){ // update the suffix array SA
SA[i] = tempSA[i];
}
Thanks a lot for your valuable time.
First compute the startIndex for each unique ranking.
REMARK: c[]
here stands for a ranking and not just for an individual character.
// compute cumulates of rankings
for (i = sum = 0; i < maxi; i++) {
int t = c[i]; c[i] = sum; sum += t;
}
Reorder the Suffix array using the just computed startIndices. Based on the ranking of the SA[i]+k
suffix.
// shuffle the suffix array if necessary
for (i = 0; i < n; i++){
tempSA[c[SA[i]+k < n ? RA[SA[i]+k] : 0]++] = SA[i];
}
Copy back the updated values from the temp array
// copy the updated values back to SA
for (i = 0; i < n; i++){
SA[i] = tempSA[i];
}
This means that the suffix starting at position i
is sorted by the ranking of the the suffix at place (i+k)
.
We sort each suffix of length k
, by the suffix of length k
at place i+k
. We can do this because at the previous iteration all suffixes were sorted for length k
.
After that we sort again from the first index. Which was holding the ranking for size k
. Since the sorting is stable, all suffixes are now sorted for length k*2
.
Our next step is to update the ranking if two contiguous suffix arrays in the ranking are not equal anymore.
for (i = 1; i < n; i++){
// if same pair => same rank r; otherwise,increase r
tempRA[SA[i]] = (RA[SA[i]] == RA[SA[i-1]] && RA[SA[i]+k] == RA[SA[i-1]+k]) ? r : ++r;
}
If the ranking for size k
at their startIndex
is the same and the ranking at their startIndex+k
is the same. Then the ranking at the startIndex
is the same for size k*2
.
This should also explain the following:
if (RA[SA[n-1]] == n-1) break; // nice optimization trick
This means that at that point the rankings for the current size are all unique. So all suffixes are unique as well and no further sorting is required.
a b c x a b c d
--------------------------------INIT-
0 1 2 3 4 5 6 7 // SA
97 98 99 120 97 98 99 100 // RA
---------------------------------K=1-
0 2 5 7 1 3 4 6 // SA
0 1 2 4 0 1 2 3 // RA
---------------------------------K=2-
1 3 5 7 0 2 4 6 // SA
1 3 5 7 0 2 4 6 // RA
// count frequencies
c['a']=2;
c['b']=2;
c['c']=2;
c['d']=1;
c['x']=1;
// switch them to startindices
c['a']=0;
c['b']=2;
c['c']=4;
c['d']=6; // e.g. in total there are 6 suffixes smaller than starting with d (2 x a, 2 x b, 2 x c)
c['x']=7;
// determine the new SA position
tempSA[c[rank(SA[i]+k)]++] = SA[i];
// decomposing first iteration
tempSA[c[rank(SA[0]+k)]++] = SA[0]; // i = 0
tempSA[c[rank(SA[0]+1)]++] = SA[0]; // k = 1
tempSA[c[rank(1)]++] = 0; // SA[0] = 0
tempSA[c['b']++] = 0; // rank(1) = 'B'
tempSA[2] = 0; // c['b']=2 => 2++ = 3
In other words: put current first suffix array at the startIndex of the suffixArray that starts k places after. And increase that startIndex by one so the next occurence won't override.
// all other iterations resulting in:
tempSA[0] = 7 // d (sorted by EMPTY)
tempSA[1] = 3 // x (sorted by a)
tempSA[2] = 0 // a (sorted by b)
tempSA[3] = 4 // a (sorted by b)
tempSA[4] = 1 // b (sorted by c)
tempSA[5] = 5 // b (sorted by c)
tempSA[6] = 6 // c (sorted by d)
tempSA[7] = 2 // c (sorted by d)
// last step is simply copying those values to SA (I suppose you know why this is)
This is all I can give you, if you still have troubles try to go through it with the debugger or print out subresults where you have doubts.