Search code examples
algorithmcompressionbinary-data

How to determine which compression method is being used on a block of data?


I have a program that reads files from an older system. One section of data in each file (not the whole file) is compressed using an unknown compression scheme, which I need to know how to decompress. I have examples of data before and after it has been compressed. I'm not sure how to determine the compression scheme, and was hoping someone here might know how to figure this out. Based on some Googling, the most common compression scheme seems to be Huffman coding, though after implementing that and testing it, that doesn't seem to be the one that's used.

Here is an example of a section of data before compression (in hex bytes):

00 00 00 00 FF FF FF FF EC 02 00 00 23 03 00 00 23 03 00 00 03 03 FF FF FF FF FF FF FF FF FF FF FF FF 55 03 00 00 00 00 01 01 09 04 04 04 09 09 09 09 07 09 00 06 00 7C 00 1F 55 01 00 C9 1F 54 01 00 07 00 F7 00 2F 54 53 01 00 05 11 54 3D 00 2F 53 52 01 00 03 11 53 57 00 4F 53 53 52 51 01 00 01 20 52 53 1C 00 6E 53 53 52 52 51 50 01 00 02 1F 00 00 58 00 1F 51 1F 00 08 00 58 00 0E 3C 00 05 3E 00 0F 3A 00 01 00 02 00 03 5D 00 0F 1F 00 FF 63 00 B3 01 0F 02 00 01 05 F0 01 0F 02 00 04 03 2E 02 0F 02 00 06 01 6C 02 0F 02 00 03 50 54 54 54 54 54 01 00 00 00 06 00 42 00 1F 55 01 00 C5 1F 54 01 00 0C 1F 53 01 00 09 4F 54 54 53 52 01 00 07 01 1F 00 1F 51 01 00 05 03 1F 00 1F 50 01 00 03 0F 1F 00 FF C7 0F 0F 02 0B 0F 4D 02 0B 0F 8B 02 0B 0F 02 00 06 50 54 54 54 54 54 54 00 00 00 06 00 83 00 1F 55 01 00 C5 1C 54 01 00 0A E9 00 2A 54 53 01 00 03 2E 00 04 25 00 38 54 53 52 01 00 03 2C 00 03 25 00 66 54 54 54 53 52 51 01 00 03 2A 00 03 25 00 00 27 00 44 53 52 51 50 01 00 03 28 00 03 25 00 2B 52 52 1F 00 02 02 00 03 25 00 2F 51 51 1F 00 03 04 02 00 0F 1F 00 FF 89 00 BD 01 0F 02 00 01 02 D1 01 00 F9 01 0F 02 00 03 00 0F 02 00 35 02 0F 02 00 05 01 6F 02 0F 02 00 03 50 54 54 54 54 54 A7 00 00 00 0B 00 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 12 13 14 15 17 17 17 17 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 11 13 14 16 17 17 17 17 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 13 17 17 17 17 17 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 12 16 17 17 16 16 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 12 16 16 16 15 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 13 15 15 14 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 13 14 14 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 11 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10

and here is the same section after compression:

E2 00 00 00 00 FF FF FF FF EC 02 00 00 23 03 04 00 20 03 03 12 00 04 02 00 20 55 03 24 00 F1 85 01 01 09 04 04 04 09 09 09 09 07 09 00 06 00 7C 00 1F 55 01 00 C9 1F 54 01 00 07 00 F7 00 2F 54 53 01 00 05 11 54 3D 00 2F 53 52 01 00 03 11 53 57 00 4F 53 53 52 51 01 00 01 20 52 53 1C 00 6E 53 53 52 52 51 50 01 00 02 1F 00 00 58 00 1F 51 1F 00 08 00 58 00 0E 3C 00 05 3E 00 0F 3A 00 01 00 02 00 03 5D 00 0F 1F 00 FF 63 00 B3 01 0F 02 00 01 05 F0 01 0F 02 00 04 03 2E 02 0F 02 00 06 01 6C 02 0F 02 00 03 50 54 54 54 54 54 01 00 00 00 06 00 42 84 00 10 C5 84 00 90 0C 1F 53 01 00 09 4F 54 54 7E 00 F0 02 07 01 1F 00 1F 51 01 00 05 03 1F 00 1F 50 01 00 03 56 00 D0 C7 0F 0F 02 0B 0F 4D 02 0B 0F 8B 02 0B 51 00 02 4A 00 11 54 4A 00 12 83 4A 00 80 1C 54 01 00 0A E9 00 2A CD 00 71 03 2E 00 04 25 00 38 51 00 80 03 2C 00 03 25 00 66 54 5F 00 50 51 01 00 03 2A 0F 00 51 00 27 00 44 53 D3 00 20 03 28 10 00 70 2B 52 52 1F 00 02 02 0B 00 90 2F 51 51 1F 00 03 04 02 00 79 00 30 89 00 BD C8 00 60 01 02 D1 01 00 F9 0A 00 60 03 00 0F 02 00 35 CE 00 30 05 01 6F 07 00 03 D5 00 7A A7 00 00 00 0B 00 10 01 00 8A 12 13 14 15 17 17 17 17 16 00 5F 10 11 13 14 16 17 00 00 30 10 10 13 16 00 0C 2E 00 80 10 10 12 16 17 17 16 16 0A 00 0A 02 00 5A 12 16 16 16 15 13 00 01 02 00 41 13 15 15 14 09 00 0B 02 00 2F 13 14 17 00 02 0F 02 00 06 1A 11 1A 00 50 10 10 10 10 10

Here's what I know about the data:

-The length of the data after compression is variable.

-There is no common prefix for the data (the first byte varies).

I'd really appreciate it if someone could help me determine the compression scheme being used, or even point me in the right direction.


Solution

  • It is lz4 output with some bytes stripped off. If I use lz4 -l (legacy) on your input, and then strip off the first eight bytes, I get exactly your output.