Search code examples
ms-wordzipcompressiondeflatedata-recovery

Heuristics to improve brute forcing the location of a DEFLATE block


I work in the data recovery industry. Sometimes a document (.docx) has the beginning overwritten but most of the file might still be intact. A 'docx' is really just a zip file, and the primary document is stored in a file 'word/document.xml'.

A document can be compressed into multiple deflate blocks, so even if the beginning is overwritten, later blocks might be salvageable if we could only locate them. The plan would be to brute force, trying to decode the stream at every bit location. For example, a 200k file would mean attempting a decode a maximum of 1.6 megabits times (give or take).

One helpful thing is that a block should never start with '011' or '111' because bit 2 & 3 can only be 00, 01, or 10 (I believe). So that weeds out a lot of candidates.

Are there any other clues that can be used to eliminate candidates. We can assume a 'standard' stream with no weird encoding, as this is coming from Microsoft Word.


Solution

  • zlib's inflate will automatically rule out all of the patterns that are ruled out, as soon as it encounters them. So, for example, there are several constraints in a dynamic block header, which if violated will be detected immediately. Similarly, a mismatch in the stored block lengths will be detected immediately. There are invalid fixed Huffman codes, which will appear in random data within an average of 47 bytes, and will be detected as soon as they are encountered. Or also on average in 47 bytes, the fixed block will end, returning the state to detecting an error in any kind of subsequent "block" in random data.

    Also your invalid block type (3) will be detected by zlib super immediately.

    You don't need to implement any special rules yourself. zlib will detect them.

    As noted in a comment, even if you find the first bit of a block, you will then be faced with references to uncompressed data that preced that block. You will just need to keep going until you, hopefully, get to where you have none of those. Though it is entirely possible for the same string to keep being referenced, from the start of the file to the end. Unless you can correctly guess what that string was from context, the compressed data is unrecoverable.

    Note that because of these references, you will need to short out the "invalid distance too far back" error, since that will pop up a lot, even if you have hit the correct bit location to start a block. There is a compile-time option INFLATE_ALLOW_INVALID_DISTANCE_TOOFAR_ARRR that you can use for that.