My string is a Json file (test.json) with the below content
{
"objectId": "bbad4cc8-bce8-438e-8683-3e603d746dee",
"timestamp": "2021-04-28T14:02:42.247Z",
"variable": "temperatureArray",
"model": "abc.abcdefg.abcdef",
"quality": 5,
"value": [ 43.471600438222104, 10.00940101687303, 39.925500606152, 32.34369812176735, 33.07786476010357 ]
}
I am compressing it as below
using ICSharpCode.SharpZipLib.GZip;
using System;
using System.Diagnostics;
using System.IO;
using System.Reflection;
using System.Text;
namespace GZipTest
{
public static class SharpZipLibCompression
{
public static void Test()
{
Trace.WriteLine("****************SharpZipLib Test*****************************");
var testFile = Path.Combine(Path.GetDirectoryName(Assembly.GetExecutingAssembly().Location), "test.json");
var text = File.ReadAllText(testFile);
var ipStringSize = System.Text.UTF8Encoding.Unicode.GetByteCount(text);
var compressedString = CompressString(text);
var opStringSize = System.Text.UTF8Encoding.Unicode.GetByteCount(compressedString);
float stringCompressionRatio = (float)opStringSize / ipStringSize;
Trace.WriteLine("String Compression Ratio using SharpZipLib" + stringCompressionRatio);
}
public static string CompressString(string text)
{
if (string.IsNullOrEmpty(text))
return null;
byte[] buffer = Encoding.UTF8.GetBytes(text);
using (var compressedStream = new MemoryStream())
{
GZip.Compress(new MemoryStream(buffer), compressedStream, false);
byte[] compressedData = compressedStream.ToArray();
return Convert.ToBase64String(compressedData);
}
}
}
}
But my compressed string size (opStringSize) is more than the original string size (ipStringSize). Why?
Your benchmark has some fairly fundamental problems:
UTF8Encoding.Unicode
is just an unclear way of writing Encoding.Unicode
, which is UTF-16). That encodes to 2 bytes per character, but most of those bytes will be 0.It so happens that the two uses of UTF-16 more-or-less cancel out, but the base64-encoding bit is still responsible for a lot of the discrepancies you're seeing.
Take that out, and you get a compression ratio of: 0.80338985.
That's not bad, given that compression introduces overheads: there's data which always needs to appear in a GZip stream, and it's there regardless of how well your data compresses. You can only really expect compression to make any significant difference on larger inputs.