Search code examples
javaalgorithmdynamic-programmingsuffix-array

Minimum number of distinct characters to result in a given suffix array


Given a suffix array, a TopCoder task from SRM 630 asks to find the minium number of distinct characters in the string that could form a string with the given suffix array. The full problem statement can be found on the TopCoder website.

The best solution I found is right here: https://github.com/ftiasch/acm-icpc/blob/6db1ed02a727611830b974a1d4de38bab8f390f9/topcoder/single-round-match/single-round-match-630/SuffixArrayDiv1.java

Here is the algorithm written by ftiasch:

public int minimalCharacters(int[] array) {
    int n = array.length;
    int[] position = new int[n + 1];
    for (int i = 0; i < n; ++i) {
        position[array[i]] = i;
    }
    position[n] = -1;
    int[] minimum = new int[n + 1];
    for (int i = n - 1; i >= 0; --i) {
        minimum[i] = Integer.MAX_VALUE;
        for (int j = i + 1; j <= n; ++j) {
            boolean valid = true;
            for (int x = i; x < j; ++x) {
                for (int y = x + 1; y < j; ++y) {
                    valid &= position[array[x] + 1] < position[array[y] + 1];
                }
            }
            if (valid && minimum[j] < Integer.MAX_VALUE) {
                minimum[i] = Math.min(minimum[i], minimum[j] + 1);
            }
        }
    }
    return minimum[0];
}

I understand that this is a dynamic programming algorithm but how does it work? I really need a hand understanding it.

EDIT

Here is what ftiasch wrote me back:

hi Ariel,

First of all, thanks to your compliment. Frankly speaking, my solution is not the best solution to the problem. The optimal one runs in O(n) time but mine takes O(n^4). I just picked this idea during the contest because n is relatively small.

Keep in mind that same characters become continuous in the SA. Since the problem asked for the least number of characters, so I decided to use dynamic programming to partition the SA into consecutive segments so that each segments start with the same character.

Which condition is necessary for S[SA[i]] == S[SA[j]] assumed that i < j? The lexicographic comparison suggests that suffix(SA[i] + 1) should be smaller than suffix(SA[j] + 1). We can easily find that the condition is also sufficient.

Write to me if you have any other question. :)

EDIT1

We finally managed to make it work, thanks to David. Here is the linear time algorithm in java from David's Python version:

public int minimalCharacters(int[] array) {
    int n = array.length, i;
    if (n == 0)
        return 0;
    int[] array1 = new int[n + 1];
    for (i = 0; i < n; i++)
        array1[1 + i] = array[i];
    int[] position = new int[n + 1];
    for (i = 0; i < n + 1; i++)
        position[array1[i]] = i;
    int k = 1;
    for (i = n; i > 1; i--) {
        if (position[array1[i] + 1] <= position[array1[i - 1] + 1])
            k++;
    }
    return k;
}

Solution

  • Here’s a quadratic-time algorithm. The suffix array specifies for each pair of suffixes how they compare lexicographically (and the empty suffix always is less than all of them). Let s be the unknown string and suppose that we’re comparing suffix s[i...] with suffix s[j...]. If s[i] != s[j], then the comparison of s[i] and s[j] settles it. Otherwise, the result is the same as if we compare s[i+1...] and s[j+1...].

    Suppose that we wish to ensure that s[i...] < s[j...]. Clearly we need s[i] <= s[j]. In fact, unless s[i+1...] < s[j+1...], we need the strict inequality s[i] < s[j], as otherwise the tiebreaker will go the wrong way. Otherwise, s[i] == s[j] will suffice regardless of the rest of the string. Gather up all of the inequalities as arcs in a directed graph with vertices corresponding to positions in s. This graph is necessarily acyclic by the total order on suffixes. Make each arc length 1 if the inequality is strict and length 0 otherwise. Output the length of the longest path, plus one (or zero if the graph is empty).

    At least this many distinct letters are needed, by the corresponding chain of inequalities. What’s perhaps less clear is that this many distinct letters suffices, but if we determine the label of each vertex/position in s by the length of the longest path starting there, then the head and tail of each arc are labeled appropriately.

    To get down to linear time, we can exploit the structure of the graph. It’s straightforward (though not trivial; the graph is metric after all) to show that the path visiting all vertices of the graph is the longest, so we merely have to compute its length.

    Below are a transliterated version of the sample code (minChars1), an implementation straight from the description above (minChars2, now stripped of all comprehension usage), a brute force solution (minChars3), and the linear-time solution (minChars4). import itertools

    def minChars1(array):
        n = len(array)
        position = [-1] * (n + 1)
        for i in range(n):
            position[array[i]] = i
        infinity = n + 1
        minimum = [infinity] * (n + 1)
        minimum[n] = 0
        for i in range(n - 1, -1, -1):
            for j in range(i + 1, n + 1):
                valid = True
                for x in range(i, j):
                    for y in range(x + 1, j):
                        valid = valid and position[array[x] + 1] < position[array[y] + 1]
                if valid and minimum[j] < infinity:
                    minimum[i] = min(minimum[i], minimum[j] + 1)
        return minimum[0]
    
    
    def lengthOfLongestPath(graph, memo, i):
        if i not in memo:
            result = 0
            for w, j in graph[i]:
                result = max(result, w + lengthOfLongestPath(graph, memo, j))
            memo[i] = result
        return memo[i]
    
    
    def minChars2(array):
        n = len(array)
        position = [-1] * (n + 1)
        for i in range(n):
            position[array[i]] = i
        graph = {}
        for i in range(n):
            graph[i] = []
            for j in range(n):
                if position[i] > position[j]:
                    w = 0 if position[i + 1] > position[j + 1] else 1
                    graph[i].append((w, j))
        memo = {None: -1}
        for i in range(n):
            lengthOfLongestPath(graph, memo, i)
        return max(memo.values()) + 1
    
    
    def minChars3(array):
        n = len(array)
        position = [None] * n
        for i in range(n):
            position[array[i]] = i
        for k in range(n):
            for s in itertools.product(range(k), repeat=n):
                valid = True
                for i in range(n):
                    for j in range(n):
                        valid = valid and (s[i:] < s[j:]) == (position[i] < position[j])
                if valid:
                    return k
        return n
    
    
    def minChars4(array):
        n = len(array)
        if n == 0:
            return 0
        array1 = [n] * (n + 1)
        for i in range(n):
            array1[1 + i] = array[i]
        position = [None] * (n + 1)
        for i in range(n + 1):
            position[array1[i]] = i
        k = 1
        for i in range(n, 1, -1):
            if position[array1[i] + 1] <= position[array1[i - 1] + 1]:
                k += 1
        return k
    
    
    def test():
        for n in range(7):
            for array in itertools.permutations(range(n)):
                assert minChars1(array) == minChars2(array) == minChars3(array) == minChars4(array)
    
    
    test()