Search code examples
compression

How to determine unknown compression of data


How do you folks figure out how some data is compressed?

I'm trying to take apart a binary file. I see the structure in it, and have found where some data segments are.

The UNIX 'file' command just says they are data. The "Signsrch signature file" by Luigi Auriemma didn't match any of the blocks.

The file extension is ".dz". The file starts with "Dr*Z" and the data headers start with "zFED". Google searches didn't turn up any infomation on those. The data blocks have no other structure that I see - no patterns, readable strings, etc.

(There is a DZ file format, but it is proprietary, from 2000-2005, for compressing Quake game files. I haven't yet been able to run "dzip.exe" on this Macbook.)

Here is the format of the data headers:

    char[4] "zFED"        or "DEFz" flipped
    int32   full_size     size of uncompressed data, little-endian
    int32   cmpr_size     number of bytes of data here, L.E.
    byte[]  data           ...

There might be more fields in the header than this. This is how each of the four data blocks start (hex) ...

EC 7D 79 5C 54 47 ...
EC BD 7B 7C 94 C5 ...
C4 BC 07 78 23 57 ...
EC BD 7B 5F 13 D9  ...

So there could be some flags or format fields still there.

Here is the start of each data block:

BLOCK 1
000000d0        7a 46  45 44 80 dc  01 00 14 0c  01 00 ec 7d  ..zFED.\......l}
                tag--------- full_size--- cmpr_size--- [data ...]
000000e0  79 5c 54 47  b6 70 dd db  b7 ef 6d 9a  a5 1b 50 41  y\TG6p][7om.%.PA
000000f0  41 68 f7 85  08 a8 d1 68  dc 5a c3 24  0d 0a c4 24  Ahw..(Qh\ZC$..D$
00000100  6f 92 7c f3  32 71 66 92  4c 76 27 33  ef 7d 73 65  o.|s2qf.Lv'3o}se


BLOCK2
00010ce0                     7a 46  45 44 00 1e  03 00 be 5f  u_.)..zFED....>_
                             tag--------- full_size--- cmpr_size-
00010cf0  01 00 ec bd  7b 7c 94 c5  d5 38 fe 3c  7b 7b 72 db  ..l={|.EU8~<{{r[
         --size [data ...
00010d00  6c 76 37 77  2e 49 08 57  23 09 57 41  f0 12 08 e0  lv7w.I.W#.WAp..`
00010d10  26 84 8b 97  da 56 5a b5  b5 6a d5 b6  78 ab ae 37  &...ZVZ55jU6x+.7
00010d20  b2 88 12 ad  d6 2e 77 d4  56 df d6 b6  62 6d 5f 37  2..-V.wTV_V6bm_7


BLOCK 3
00026ca0               7a 46 45 44  46 81 01 00  e4 ec 00 00  h,..zFEDF...dl..
                       tag--------  full_size--  cmpr_size--
00026cb0  c4 bc 07 78  23 57 76 26  0a 80 24 48  80 48 44 22  D<.x#Wv&..$H.HD"
          [data ...
00026cc0  08 80 48 24  41 30 81 39  b3 99 73 0e  60 68 66 b2  ..H$A0.93.s.`hf2
00026cd0  99 9b a9 d9  cc cd 50 e4  a8 31 54 ad  68 ef d8 e3  ..)YLMPd(1T-hoXc


BLOCK 4
00035980                            7a 46 45 44  60 f8 13 00  ..<\s6k?zFED`x..
                                    tag--------  full_size--
00035990  7a 9c 00 00  ec bd 7b 5f  13 d9 b6 b6  5d 15 82 78  z...l={_.Y66]..x
          cmpr_size--  [data ...
000359a0  06 14 05 3c  c6 43 e3 a1  5b 14 c4 33  4a c9 49 51  ...<FCc![.D3JIIQ
000359b0  54 14 5b d1  76 b5 1d 21  2d 59 62 e2  0a a1 5b fb  T.[Qv5.!-Ybb.![{

The histograms of the data are fairly flat. One fluctuates much more, however, and it's the one I'm most interested in.

histogram of block 1 histogram of block 2 histogram of block 3 histogram of block 4

Examining the histograms

    Block     Usual Span    Notes
    Block 1   0.3   0.45    peaks to 0.5%
    Block 1   0.35  0.45    peaks to 0.3% and 0.5%
    Block 3   0.34  0.44    peaks to 0.32% and 0.49%
    Block 4   0.05  2%      peaks just before 64 128 160 208
                            dips at 32 48 64 96 128 160

The file I'm looking at is "Kurzweil-SP-Updater.dz" inside this file: https://kurzweil.com/wp-content/uploads/2022/08/SP7G_UpdateE1.06L1.1.2.zip

The question is: What should I try next? Thank you!


Solution

  • The majority of the file (99.9%) consists of four complete deflate streams:

    offset 222, length 68616, decompressed 121984
    offset 68850, length 90034, decompressed 204288
    offset 158896, length 60632, decompressed 98630
    offset 219540, length 40045, decompressed 1308768