Search code examples
pythoncompressiongziparchivewinrar

Huge Compression Difference in RAR archive and Gzip: Is there anything that I am missing?


I have a huge text file of about 500MB in size. I tried to archive it with Gzip both from a python program and the command line. But, in both cases the archived file's size is about 240MB, whereas while archiving with WinRAR in Windows, the archived file size is around 450KB. Is there something I am missing here? Why is there so much difference and what can I do to achieve the similar level of compression?

I have tagged this with Python also, as any python code regarding this will be very helpful.

Here is first 3 lines of the file:

$ head 100.txt -n 3
31731610:22783120;
22783120:
45476057:39683372;5879272;54702019;58780534;30705698;60087296;98422023;55173626;5607459;843581;11846946;97676518;46819398;60044103;48496022;35228829;6594795;43867901;66416757;81235384;42557439;40435884;60586505;65993069;76377254;82877796;94397118;39141041;2725176;56097923;4290013;26546278;18501064;27470542;60289066;43986553;67745714;16358528;63833235;92738288;77291467;54053846;93392935;10376621;15432256;96550938;25648200;10411060;3053129;54530514;97316324;

Solution

  • It is possible that the file is highly redundant with a repeating pattern that is larger than 32K. gzip's deflate only looks 32K back for matches, whereas the others can capitalize on history much further back.

    Update:

    I just made a file that is a 64K block of random data, repeated 4096 times (256 MB). gzip (with 32K window) was blind to the redundancy and so unable to compress it. gzip expanded it to 256.04 MB. xz (LZMA with 8 MB window) compressed it to 102 KB.