I've coded a Suffix Array implementation and discovered an issue in my implementation. Concretely I've outputted the first few suffix array ranks RA[0..7]
of this string(length = 10^5) and had the following output:
80994
84360
87854
91517
95320
99277
83068
But the correct one had to be (everything shifted by 23):
81017
84383
87877
91540
95343
99300
83091
I know two ways how to fix it, but I don't know why it worked.
The first way was adding S[N++] = '$';
to the top of the buildSA()
function (then the output was 1 less than the correct one, but it doesn't matter)
I also found another solution by decreasing the MAX_N constant to 1e5 + 10!
This is so much magic for me and I really need to know why this bug happened because I don't want to have this bug again.
#include <cstdio>
#include <cstring>
#include <algorithm>
using std::max;
const int MAX_N = 2e5 + 10;
int SA[MAX_N]; // The ith element is the index of the suffix
int RA[MAX_N]; // The rank of the suffix at i
int tmp[MAX_N]; // A temporary array
int B[MAX_N]; // An array for the buckets
int N;
char S[MAX_N];
void bucketSort(int k){
int i, m = max(256, N);
for(i = 0; i < m; i++)
B[i] = 0;
for(i = 0; i < N; i++)
B[i + k < N ? RA[i + k] : 0] ++;
for(i = 1; i < m; i++)
B[i] += B[i - 1];
for(i = N - 1; i >= 0; i--)
tmp[--B[SA[i] + k < N ? RA[SA[i] + k] : 0]] = SA[i];
for(i = 0; i < N; i++)
SA[i] = tmp[i];
}
void buildSA(){
for(int i = 0; i < N; i++){
SA[i] = i;
RA[i] = S[i];
}
for(int k = 1; k < N; k <<= 1){
bucketSort(k);
bucketSort(0);
int norder = 0;
tmp[SA[0]] = 0;
for(int i = 1; i < N; i++){
if(RA[SA[i]] == RA[SA[i - 1]] && RA[SA[i] + k] == RA[SA[i - 1] + k])
{} else norder++;
tmp[SA[i]] = norder;
}
for(int i = 0; i < N; i++)
RA[i] = tmp[i];
if(norder == N)
break;
}
}
void printSA(){
for(int i = 0; i < N; i++){
printf("%d: %s\n", SA[i], S + SA[i]);
}
}
int main(){
scanf("%s", S);
N = strlen(S);
buildSA();
for(int i = 0; i < 7; i++){
printf("%d\n",RA[i]);
}
return 0;
}
In the following line:
if(RA[SA[i]] == RA[SA[i - 1]] && RA[SA[i] + k] == RA[SA[i - 1] + k])
SA[i] + k
can be >=N
(the same is for SA[i - 1] + k
).
It should be (SA[i] + k) % N
instead.