Search code examples
filegziparchive

Find gzip start and end?


I have some file, there's some random bytes, and multiple gzip files. How can i find start and end of gzip stream inside the some file? there's many random bytes between gzip streams. So, basically i need to find any gzip file and get it from there.


Solution

  • Reading from the RFC 1952 - GZIP :

    Each GZIP file is just a bunch of data chunks (called members), one for each file contained.

    Each member starts with the following bytes:

    • 0x1F (ID1)
    • 0x8B (ID2)
    • compression method. 0x08 for a DEFLATEd file. 0-7 are reserved values.
    • flags. The top three bits are reserved and must be zero.
    • (4 bytes) last modified time. May be set to 0.
    • extra flags, defined by the compression method.
    • operating system, actually the file system. 0=FAT, 3=UNIX, 11=NTFS

    The end of a member is not delimited. You have to actually walk the entire member. Note that concatenating multiple valid GZIP files creates a valid GZIP file. Also note that overshooting a member may still result in a successful reading of the member (unless the decompressing library is fail-eagerly-and-completely).