Search code examples
algorithmbig-olossless-compression

Optimal runtime for simple compression scheme


Here is a simple puzzle with some applications in bio-informatics. It's an abstracted version of something that came up at a friend's job.

Consider a very simple compression scheme, whose output consists of a stream of two operations:

  • put(a): outputs a character a
  • dup(): duplicates all output written so far

For notational convenience, write the character x itself for put('x') and * for dup().

For example, "a**b*c" expands to "aaaabaaaabc".

To compress a given string s, find the shortest list of these two operations to generate it.

For example for "aaaaaaaaaab" shortens to a**a*b. (a***aab is also possible, but one character longer.)

My question is: what's the best achievable runtime for optimal compression? (And what's the algorithm that achieves that runtime.)

I believe linear runtime is possible, but I haven't found anything definitely better than quadratic, yet. (Not too worried about using extra space.)


Solution

  • Yes, a linear run time is possible for this compression scheme.

    Create a list dp. The ith element of this list will the best compression possible for the first i elements of the string.

    dp[1] = 1
    dp[i] = dp[i-1] + 1
    if i is even and first i/2 elements of string are equal to second i/2 elements:
        dp[i] = min(dp[i], dp[i/2] + 1)
    

    To check if first i/2 elements are equal to second i/2 elements, you can find the longest common prefix between the string and the suffix starting at index i/2. If this prefix is greater than or equal to i/2 in length, then the first i/2 elements are indeed equal to the second i/2 elements.

    Speeding up this operation is possible using a modified LCP array.

    First, build a suffix array for the string in O(n).

    Then, build a longest common prefix array for the suffix array in O(n).

    Now, find the index of the complete string in the suffix array. Let's say it's i. Now, iterate from i to end of LCP array replacing each value with the minimum seen so far. Similarly, iterate down from i-1 to 0 in the LCP array replacing each value with the minimum seen so far.

    Once this is done, each value in the LCP array represents the longest common prefix of that suffix with the complete string, which is what the algorithm required. Note that these values are ordered according to the sorted suffixes rather than the position of the suffixes in the string. Mapping it is fairly simple using the suffix array though.