Search code examples
c#compressiongzipdeflatesharpziplib

Why am I getting GZip compression size of a string more than the original size after compression when using SharpZipLib in C#


My string is a Json file (test.json) with the below content

{
  "objectId": "bbad4cc8-bce8-438e-8683-3e603d746dee",
  "timestamp": "2021-04-28T14:02:42.247Z",
  "variable": "temperatureArray",
  "model": "abc.abcdefg.abcdef",
  "quality": 5,
  "value": [ 43.471600438222104, 10.00940101687303, 39.925500606152, 32.34369812176735, 33.07786476010357 ]
}

I am compressing it as below

using ICSharpCode.SharpZipLib.GZip;
using System;
using System.Diagnostics;
using System.IO;
using System.Reflection;
using System.Text;

namespace GZipTest
{
    public static class SharpZipLibCompression
    {
        public static void Test()
        {
            Trace.WriteLine("****************SharpZipLib Test*****************************");
            var testFile = Path.Combine(Path.GetDirectoryName(Assembly.GetExecutingAssembly().Location), "test.json");
            var text = File.ReadAllText(testFile);
            var ipStringSize = System.Text.UTF8Encoding.Unicode.GetByteCount(text);
            var compressedString = CompressString(text);
            var opStringSize = System.Text.UTF8Encoding.Unicode.GetByteCount(compressedString);
            float stringCompressionRatio = (float)opStringSize / ipStringSize;
            Trace.WriteLine("String Compression Ratio using SharpZipLib" + stringCompressionRatio);
        }

        public static string CompressString(string text)
        {
            if (string.IsNullOrEmpty(text))
                return null;
            byte[] buffer = Encoding.UTF8.GetBytes(text);
            using (var compressedStream = new MemoryStream())
            {
                GZip.Compress(new MemoryStream(buffer), compressedStream, false);
                byte[] compressedData = compressedStream.ToArray();
                return Convert.ToBase64String(compressedData);
            }
        }
    }
}

But my compressed string size (opStringSize) is more than the original string size (ipStringSize). Why?


Solution

  • Your benchmark has some fairly fundamental problems:

    1. You're using UTF-16 to encode the input string to bytes when calculating its length (UTF8Encoding.Unicode is just an unclear way of writing Encoding.Unicode, which is UTF-16). That encodes to 2 bytes per character, but most of those bytes will be 0.
    2. You're base64-encoding your output. While this is a way to print arbitrary binary data as text, it uses 4 characters to represent 3 bytes of data, so you're increasing the size of your output by 33%.
    3. You're then using UTF-16 to turn the base64-encoded string into bytes again, which takes 2 bytes per character again. So that's an artificial 2x added to your result...

    It so happens that the two uses of UTF-16 more-or-less cancel out, but the base64-encoding bit is still responsible for a lot of the discrepancies you're seeing.

    Take that out, and you get a compression ratio of: 0.80338985.

    That's not bad, given that compression introduces overheads: there's data which always needs to appear in a GZip stream, and it's there regardless of how well your data compresses. You can only really expect compression to make any significant difference on larger inputs.

    See here.