I have a compressed file that is a CAB that I wish to extract a file from in Linux. Since there aren't any native CAB extractors on Linux, I figured I'd try my hand at getting one done.
While I've seen the MSZIP documentation[0] as well as [1] and [2], I'm having difficulties in decompressing it even given that each Block is compressed using a modified DEFLATE compressor. While I did use [3] to decompress the blocks, aside for the first block, the rest of the blocks had a lot of missing data [data which were completely nulled [i.e. 0x00 filled].
So I'm lead to believe I'll need to figure it out manually.
Unfortunately, I am having trouble understanding [2]. So, with the following data:
ED 9D 79 70 1C C7 75 87 07 E0 4D 88 24 EE 8B 94
04 1E A2 44 51 04 81 05 08 2D 11 4A C2 5E 00 96
...
And the following code to dissect the data:
#!/bin/env python
import os
def to_bin(in_item):
b = ord(chr(in_item))
retval = bin(b).replace("0b", "")
if len(retval) < 8:
lx = 8 - len(retval)
q = "0" * lx
retval = q + retval
return retval
def swapbytes(in_bit_str):
p = in_bit_str[:4]
q = in_bit_str[4:]
return q + p
pth = os.path.join("comp", "test_0.dat")
fp = open(pth, 'rb')
fpd = fp.read(10)
fp.close()
dx = ""
for item in fpd:
cx = to_bin(item)
scx = swapbytes(cx)
dx += scx
print(cx)
print("---> %s" % scx)
print("Final: %s" % dx)
bfinal = dx[0]
btype = dx[1:3]
print("Bfinal: %s" % bfinal)
print("Btype: %s" % btype)
rest = dx[3:]
hlit = rest[:5]
hdist = rest[5:11]
hclen = rest[11:15]
hlitd = int(hlit, 2)
hdistd = int(hdist, 2)
hclend = int(hclen, 2)
print("HLIT: %s [%d]" % (hlit, hlitd))
print("HDIST: %s [%d]" % (hdist, hdistd))
print("HCLEN: %s [%d]" % (hclen, hclend))
I get the following output:
$ python tstdcmp.py
11101101
---> 11011110
10011101
---> 11011001
01111001
---> 10010111
01110000
---> 00000111
00011100
---> 11000001
11000111
---> 01111100
01110101
---> 01010111
10000111
---> 01111000
00000111
---> 01110000
11100000
---> 00001110
Final: 1101111011011001100101110000011111000001011111000101011101111000011100000
0001110
Bfinal: 1
Btype: 10
HLIT: 11110 [30]
HDIST: 11011 [27]
HCLEN: 0011 [3]
So the basic concept that I got from the RFC1951 [2], was that in a byte, it's from highest bit to lowest bit. i.e. 7 6 5 4 3 2 1 0 But for multiple bytes, i.e. AB CD, the bytes are swapped to CD AB.
So with the code, I determined that the first two bytes ED 9D
should actually be 9D ED
.
And in bits:
11011110 11011001
That would mean BFinal == 1 and BTYPE == 10 (so dynamic huffman encoding).
With the first three bits removed, I get:
11110 11011001
So from 3.2.7 from [2], I should have the following info of the next few sets of bits:
5 Bits: HLIT, # of Literal/Length codes - 257 (257 - 286)
5 Bits: HDIST, # of Distance codes - 1 (1 - 32)
4 Bits: HCLEN, # of Code Length codes - 4 (4 - 19)
So, HLIT = 11110 [30] HDIST = 11011 [27] HCLEN = 0011 [3]
But from the above, the HLIT should be between 257 and 286 and HCLEN between 4 and 9; but I get HLIT = 30 and HCLEN = 3.
That said, the documentation in 3.2.7:
For even greater compactness, the code length sequences
themselves are compressed using a Huffman code.
So is this an example of further compressing with huffman codes? I had hoped to understand this RFC1951[2]; but it's terribly confusing.
I have seen some Youtube videos on DEFLATE compression; but they show mostly on the algorithm and not how the bytes are packed in a file.
Any help greatly appreciated.
Thanks
[1] - How to decompress MSZIP files with c#?
[2] - https://datatracker.ietf.org/doc/html/rfc1951
[3] - https://www.nayuki.io/page/simple-deflate-decompressor
HLIT
, etc. values are stored exactly as stated in the RFC. The first five bits are the number of literal/length codes minus 257. So you take the value of the five bits, which gives a number in 0..31, and add 257 to that. That gives a number in the range 257..288. The allowed range is actually 257..286 as noted on that same line, so the last two possible values of the five bits, 30 and 31, should not appear in a valid deflate stream.inflate
functions.inflateResetKeep()
function of zlib to restart the inflate process while retaining the dictionary from the previous inflate operation.For reference, here is a decoding of the initial bytes of the deflate stream you provided, using infgen:
! infgen 2.5 output
!
last ! 1
dynamic ! 10
count 286 30 16 ! 1100 11101 11101
code 16 4 ! 100
code 17 7 ! 111
code 0 4 ! 100 000
code 8 3 ! 011
code 7 4 ! 100
code 9 3 ! 011
code 6 4 ! 100
code 10 3 ! 011
code 5 4 ! 100
code 11 3 ! 011
code 4 5 ! 101
code 12 3 ! 011
code 3 7 ! 111
code 2 6 ! 110 000
lens 6 ! 1100
lens 8 ! 000
lens 8 ! 000
lens 9 ! 001
repeat 6 ! 11 1110
lens 9 ! 001
lens 8 ! 000
lens 10 ! 010
lens 9 ! 001
lens 9 ! 001
lens 9 ! 001
lens 8 ! 000
repeat 6 ! 11 1110
repeat 4 ! 01 1110
lens 9 ! 001
lens 9 ! 001
lens 10 ! 010
lens 10 ! 010
lens 10 ! 010
lens 8 ! 000
lens 9 ! 001
repeat 3 ! 00 1110
lens 10 ! 010
lens 9 ! 001
lens 10 ! 010
lens 10 ! 010
lens 9 ! 001
lens 10 ! 010
lens 9 ! 001
lens 10 ! 010
lens 9 ! 001
lens 8 ! 000
lens 9 ! 001
lens 8 ! 000
lens 8 ! 000
lens 7 ! 1101
lens 8 ! 000
lens 8 ! 000
lens 9 ! 001
lens 8 ! 000
lens 10 ! 010
lens 7 ! 1101
lens 9 ! 001
lens 8 ! 000
lens 12 ! 100
lens 9 ! 001
lens 10 ! 010
lens 10 ! 010
lens 10 ! 010
lens 8 ! 000
lens 7 ! 1101
repeat 4 ! 01 1110
lens 8 ! 000
lens 8 ! 000
lens 8 ! 000
lens 7 ! 1101
lens 9 ! 001