This question may not be strictly limited to the LZW algorithm, and may cover other implementations of LZ77 and LZ78:
I've been trying to write a compression/decompression utility involving the LZW dictionary coding scheme. The problem is that I've found it necessary to include a delimiting character (a space) after each codeword (or "code") is written out during the encoding phase. I've been doing this because I can't assume that the output is being streamed directly to the decoder, and could be stored in a compressed file to be decoded later (in which case, the decoder will need some way to detect what separates codewords - the delimiter).
I've recently been told that this is unnecessary, and that the decoder should be able to "figure out" dynamically how much of the compressed file to read in each time, presumably based on the previously read code(s). This would supposedly eliminate the (costly) need to insert an extra byte after each code.
I'm just not sure how the decoder can figure this out. Maybe someone that understands how it works could explain it to me? Thanks.
EDIT:
The dictionary is a hash table mapping "input strings" to integers (codes), and is built up in the usual way as more of the input data is read in. The codes are written out to the compressed file. The decoder reads in each code (integer) from the compressed file and either looks up its dictionary for the associated string to output, or if there's no entry for that code, then it figures out what the string should be in the usual way and updates its dictionary.
"Why does it make a difference if the file is streamed or stored?" If the output of the encoder is streamed one code at a time to the decoder, then the decoder can work with each code as it receives them. But if the encoder writes all the codes to a file (the compressed file), and then later that file is fed into the decoder, then how does the decoder know where one code starts and another one begins. The file would just be a mashed up sequence of digits.
For example: Delimited Compressed file: 127 32 45 22 228 122 209.... Non-Delimited Compressed file: 127324522228122209...
-Rob
In LZW the dictionary is not stored with the compressed file. (Or the dictionary is the file, depeneding on your perspective.) Each value that is written to the file is of a predefined bit width according to its location. For example, it can start off as pairs of 9-bit dictionary indexes followed by 8-bit data, until the indexes are exhausted (which happenes at a precise location) when it switches to 10-bit indexes.
The details depened on how you implement the compression. Some do a constant 12-bit indexes. But in no case are extra delimiters required.
Also, since the data is not aligned on 8-bit boundaries, there isn't a way to detect delimiters if you are not already reading the data correctly!
Edit:
If you are hoping for the LZW compression algorithm to actually generate smaller data than the input, then there are several things that you should do.
First, you must write the file as binary rather than text. Writing it as text will expand rather than contract the size of the file. The value 127 can be stored in a single byte in binary (01111111), but requires four bytes of UTF-8 with a delimiting space ("127 " = 00110001 00110010 00110111 00100000).
Second, LZW is designed to work with code values that are larger than one byte but less than two bytes, so you must do some bit-twiddling to output the data correctly. A single byte is only enough to encode the first 256 implicitly defined table entries. Another bit will give you another 256 entries to work with, but the entries in a 9-bit indexed table are exhausted quickly. With 12 bits, you can get 4096 table entries, which is a reasonable table size. If you were to use two full bytes, then you would have a rather large table with 65 K entries. The trouble with this is that if you are not using the full capacity of your table space, then you are wasting bits. The output would have lots of bits in it that are always zero, and this will be really bad for your compression ratios.
Third, a streaming coder/decoder cannot do single values at a time, because the encoded data overlaps byte boundaries. If a constant 12-bit code size is used, then some multiple of two coded values may be processed at one time. But in general the algorithm is designed to work with complete files.