Search code examples
pythonalgorithmsuffix-array

Suffix Array, n^2*log(n) faster than n*log^2(n) even for large inputs?


I learned this theory in class and decided to implement everything to make sure I understood it all; but the theoretical results don't align with the memory and time usages I obtain. I'm pretty sure this is not a shortcoming of theoretical algorithms but rather a mistake on my end and would like to investigate it.

The naive approach to constructing a suffix array is to generate all suffixes of the input string and then sort them lexicographically to obtain the final suffix array.

def naive_suffix_array(text: str) -> list[int]:
    n = len(text)
    suffixes = [(text[i:], i) for i in range(n)]
    suffixes.sort()
    suffix_array = [index for _, index in suffixes]
    return suffix_array

This algorithm has a time complexity of O(n^2 * log n) since we are generating all n suffixes of the input string (O(n)) and then sorting them, which takes O(n log n) * O(n) time (because ordering two strings of size n takes O(n)). Space complexity is O(n^2).

The efficient algorithm uses a "doubling" technique that sorts the suffixes based on the first k characters, where k is doubled in each step until k >= n. This approach reduces the time complexity to O(n * log(n)^2). Space complexity is O(n).

def efficient_suffix_array(text: str) -> list[int]:
    text = text + "$"  # add a special character to the end of the string
    
    n = len(text)
    k = 1

    # Initialize rank array with character values
    rank = [ord(c) for c in text]

    while k < n:
        # Sort the suffixes based on the first k characters (suffixes is a list of indices)
        suffixes = sorted(
            range(n),
            key=lambda i: (rank[i], rank[(i + k) % n])
        )
        
        # Update the rank array with new rank values
        new_rank = [0] * n
        rank_value = 0  # goes up by one when the rank changes
        prev_suffix = suffixes[0]  # the best suffix
        new_rank[prev_suffix] = rank_value  # the best suffix gets rank = 0

        for i in range(1, n):
            cur_suffix = suffixes[i]  
            if rank[cur_suffix] != rank[prev_suffix] or rank[(cur_suffix + k) % n] != rank[(prev_suffix + k) % n]:
                # if rank of first k characters are different, or they are the same but rank of next k characters are different
                rank_value += 1
            new_rank[cur_suffix] = rank_value
            prev_suffix = cur_suffix
        
        if rank_value == n - 1:
            # if all suffixes are in different equivalence classes, we are done, we don't need to look at the next characters,
            # the prefixes are already unique
            break

        rank = new_rank
        k *= 2

    return suffixes[1:]  # the first suffix is always the $, so we don't need it

Both algorithms work (on small unit tests that I wrote manually AND on large inputs they return the same results).

However, as you can see here, for an input of size 1000, nor the time complexity neither the space complexity of the efficient suffix array function are better: enter image description here

I generate my large input with the following command:

import random
text: str = "".join(random.choice("abn") for _ in range(1000))

Solution

  • I actually realised my "large inputs" were just not "large" enough.

    For super duper large inputs:

    enter image description here