Creating a Non-greedy LZW algorithm

Basically, I'm doing an IB Extended Essay for Computer Science, and was thinking of using a non-greedy implementation of the LZW algorithm. I found the following links:

And have been operating under the assumption that the algorithm described in paper 1 and the LZW-FP in paper 2 are essentially the same. Either way, tracing the pseudocode in paper 1 has been a painful experience that has yielded nothing, and in the words of my teacher "is incredibly difficult to understand." If anyone can figure out how to trace it, or happens to have studied the algorithm before and knows how it works, that'd be a great help.

Solution

Note: I refer to what you call "paper 1" as Horspool 1995 and "paper 2" as Matias et al 1998. I only looked at the LZW algorithm in Horspool 1995, so if you were referring to the LZSS algorithm this won't help you much.

My understanding is that Horspool's algorithm is what the authors of Matias et al 1998 call "LZW-FPA", which is different from what they call "LZW-FP"; the difference has to do with the way the algorithm decides which substrings to add to the dictionary. Since "LZW-FP" adds exactly the same substrings to the dictionary as LZW would add, LZW-FP cannot produce a longer compressed sequence for any string. LZW-FPA (and Horspool's algorithm) add the successor string of the greedy match at each output cycle. That's not the same substring (because the greedy match doesn't start at the same point as it would in LZW) and therefore it is theoretically possible that it will produce a longer compressed sequence than LZW.

Horspool's algorithm is actually quite simple, but it suffers from the fact that there are several silly errors in the provided pseudo-code. Implementing the algorithm is a good way of detecting and fixing these errors; I put an annotated version of the pseudocode below.

LZW-like algorithms decompose the input into a sequence of blocks. The compressor maintains a dictionary of available blocks (with associated codewords). Initially, the dictionary contains all single-character strings. It then steps through the input, at each point finding the longest prefix at that point which is in its dictionary. Having found that block, it outputs its codeword, and adds to the dictionary the block with the next input character appended. (Since the block found was the longest prefix in the dictionary, the block plus the next character cannot be in the dictionary.) It then advances over the block, and continues at the next input point (which is just before the last character of the block it just added to the dictionary).

Horspool's modification also finds the longest prefix at each point, and also adds that prefix extended by one character into the dictionary. But it does not immediately output that block. Instead, it considers prefixes of the greedy match, and for each one works out what the next greedy match would be. That gives it a candidate extent of two blocks; it chooses the extent with the best advance. In order to avoid using up too much time in this search, the algorithm is parameterised by the number of prefixes it will test, on the assumption that much shorter prefixes are unlikely to yield longer extents. (And Horspool provides some evidence for this heuristic, although you might want to verify that with your own experimentation.)

In Horspool's pseudocode, α is what I call the "candidate match" -- that is, the greedy match found at the previous step -- and β_j is the greedy successor match for the input point after the j^th prefix of α. (Counting from the end, so β₀ is precisely the greedy successor match of α, with the result that setting K to 0 will yield the LZW algorithm. I think Horspool mentions this fact somewhere.) L is just the length of α. The algorithm will end up using some prefix of α, possibly (usually) all of it.

Here's Horspool's pseudocode from Figure 2 with my annotations:

initialize dictionary D with all strings of length 1;
set α = the string in D that matches the first
        symbol of the input;
set L = length(α);
while more than L symbols of input remain do
begin
    // The new string α++head(β0) must be added to D here, rather
    // than where Horspool adds it. Otherwise, it is not available for the
    // search for a successor match. Of course, head(β0) is not meaningful here
    // because β0 doesn't exist yet, but it's just the symbol following α in
    // the input.
    for j := 0 to max(L-1,K) do
        // The above should be min(L - 1, K), not max.
        // (Otherwise, K would be almost irrelevant.)
        find βj, the longest string in D that matches
            the input starting L-j symbols ahead;
    add the new string α++head(β0) to D;
    // See above; the new string must be added before the search
    set j = value of j in range 0 to max(L-1,K)
            such that L - j + length(βj) is a maximum;
    // Again, min rather than max
    output the index in D of the string prefix(α,j);
    // Here Horspool forgets that j is the number of characters removed
    // from the end of α, not the number of characters in the desired prefix.
    // So j should be replaced with L - j
    advance j symbols through the input;
    // Again, the advance should be L - j, not j
    set α = βj;
    set L = length(α);
end;
output the index in D of string α;