I have two xml files, one is compressed by LZW, other is in plain text. How can I know whether is compressed or not?
The obvious thing to do would of course be feed the string to a LZW decompressor and see if there is an error and/or the length of the string increases by approximately 200%.
That aside, a (well-formed) LZW string or file stars with the magic value 0x1F 0x9D
. Of course it is possible to LZW compress a string and not include the magic value, but it is a start (very easy to check).
A (well-formed) XML document should start with an XML declaration and must start with an element, only optionally preceded by whitespace. XML declarations start with the string <?xml
and element tags must start with a letter.
Therefore, if you see anything but whitespace before encountering the first <
or if the next character that follows is not either ?
or a letter (and only letters and numbers follow before encountering a >
), then the string cannot be XML. Since you know that the string is either XML or compressed XML, it must therefore be compressed. It's probably easy enough for someone with a little regex practice to squeeze that in a 10-15 character pattern.