I learned this theory in class and decided to implement everything to make sure I understood it all; but the theoretical results don't align with the memory and time usages I obtain. I'm pretty sure this is not a shortcoming of theoretical algorithms but rather a mistake on my end and would like to investigate it.
The naive approach to constructing a suffix array is to generate all suffixes of the input string and then sort them lexicographically to obtain the final suffix array.
def naive_suffix_array(text: str) -> list[int]:
n = len(text)
suffixes = [(text[i:], i) for i in range(n)]
suffixes.sort()
suffix_array = [index for _, index in suffixes]
return suffix_array
This algorithm has a time complexity of O(n^2 * log n)
since we are generating all n
suffixes of the input string (O(n))
and then sorting them, which takes O(n log n) * O(n)
time (because ordering two strings of size n
takes O(n)
). Space complexity is O(n^2)
.
The efficient algorithm uses a "doubling" technique that sorts the suffixes based on the first k
characters, where k
is doubled in each step until k >= n
. This approach reduces the time complexity to O(n * log(n)^2)
. Space complexity is O(n)
.
def efficient_suffix_array(text: str) -> list[int]:
text = text + "$" # add a special character to the end of the string
n = len(text)
k = 1
# Initialize rank array with character values
rank = [ord(c) for c in text]
while k < n:
# Sort the suffixes based on the first k characters (suffixes is a list of indices)
suffixes = sorted(
range(n),
key=lambda i: (rank[i], rank[(i + k) % n])
)
# Update the rank array with new rank values
new_rank = [0] * n
rank_value = 0 # goes up by one when the rank changes
prev_suffix = suffixes[0] # the best suffix
new_rank[prev_suffix] = rank_value # the best suffix gets rank = 0
for i in range(1, n):
cur_suffix = suffixes[i]
if rank[cur_suffix] != rank[prev_suffix] or rank[(cur_suffix + k) % n] != rank[(prev_suffix + k) % n]:
# if rank of first k characters are different, or they are the same but rank of next k characters are different
rank_value += 1
new_rank[cur_suffix] = rank_value
prev_suffix = cur_suffix
if rank_value == n - 1:
# if all suffixes are in different equivalence classes, we are done, we don't need to look at the next characters,
# the prefixes are already unique
break
rank = new_rank
k *= 2
return suffixes[1:] # the first suffix is always the $, so we don't need it
Both algorithms work (on small unit tests that I wrote manually AND on large inputs they return the same results).
However, as you can see here, for an input of size 1000, nor the time complexity neither the space complexity of the efficient suffix array function are better:
I generate my large input with the following command:
import random
text: str = "".join(random.choice("abn") for _ in range(1000))