Search code examples
c++lzw

How to Detect a String is Compressed by LZW Algorithm in C++


I have two xml files, one is compressed by LZW, other is in plain text. How can I know whether is compressed or not?


Solution

  • The obvious thing to do would of course be feed the string to a LZW decompressor and see if there is an error and/or the length of the string increases by approximately 200%.

    That aside, a (well-formed) LZW string or file stars with the magic value 0x1F 0x9D. Of course it is possible to LZW compress a string and not include the magic value, but it is a start (very easy to check).

    A (well-formed) XML document should start with an XML declaration and must start with an element, only optionally preceded by whitespace. XML declarations start with the string <?xml and element tags must start with a letter.
    Therefore, if you see anything but whitespace before encountering the first < or if the next character that follows is not either ? or a letter (and only letters and numbers follow before encountering a >), then the string cannot be XML. Since you know that the string is either XML or compressed XML, it must therefore be compressed. It's probably easy enough for someone with a little regex practice to squeeze that in a 10-15 character pattern.