I would like to compress a string (of any size) like "Hello My name is Bob, Im doing fine" to a smaller string like "jg3K9dlj". And be able to Decompress it back. Input and Output should be both two strings.
I have found this code which is easy to use: 2 functions Compress() and Decompress(). Unfortunately it gives a longer string as a result. I have also found other examples where Bytes Array are used but its not possible to show them as a string (completely unreadable). And every time I use Convert.ToBase64String(bytes) as it is the case also here, then we get a longer string than the original one. Thank for any suggestions!
public static string Compress(string uncompressedString)
{
byte[] compressedBytes;
using (var uncompressedStream = new MemoryStream(Encoding.UTF8.GetBytes(uncompressedString)))
{
using (var compressedStream = new MemoryStream())
{
// setting the leaveOpen parameter to true to ensure that compressedStream will not be closed when compressorStream is disposed
// this allows compressorStream to close and flush its buffers to compressedStream and guarantees that compressedStream.ToArray() can be called afterward
// although MSDN documentation states that ToArray() can be called on a closed MemoryStream, I don't want to rely on that very odd behavior should it ever change
using (var compressorStream = new DeflateStream(compressedStream, CompressionLevel.Fastest, true))
{
uncompressedStream.CopyTo(compressorStream);
}
// call compressedStream.ToArray() after the enclosing DeflateStream has closed and flushed its buffer to compressedStream
compressedBytes = compressedStream.ToArray();
}
}
return Convert.ToBase64String(compressedBytes);
}
public static string Decompress(string compressedString)
{
byte[] decompressedBytes;
var compressedStream = new MemoryStream(Convert.FromBase64String(compressedString));
using (var decompressorStream = new DeflateStream(compressedStream, CompressionMode.Decompress))
{
using (var decompressedStream = new MemoryStream())
{
decompressorStream.CopyTo(decompressedStream);
decompressedBytes = decompressedStream.ToArray();
}
}
return Encoding.UTF8.GetString(decompressedBytes);
}
Basic theory: In order to compress without loss of information you must find a way to represent chunks of data with other chunks that require less space. By "chunks of data", I mean in any useful, appropriate units, so - for example - bytes, double-bytes, words of English text, sequences of bits etc. You have specified strings on input and output which means we have to work with characters or character-groups.
This means a few things: (1) if all characters can appear randomly and with equal probability then the string is not compressible - ever (2) if you decide to sample the string (to establish useful groupings and frequency counts) then you must carry the result with the compressed data (an overhead) - always (3) if you will only be dealing with - for example - English text strings then you can come up with a substitution scheme but (4) generally useful, meaningful/worthwhile compression may not be possible.
A primitive substitution scheme might go something like this (using a low-frequency character to signal a substitution, I have chosen #
for this example):
## for # (*expands* to 2 characters in compressed string)
#1 for high-frequency word#1 e.g. "Hello"
#2 "doing", #3 "fine", #4 "name"
etc
Then you can get "#1 my #4 is Bob, Im #2 #3"
(25 chars, 28% saved) for "Hello My name is Bob, Im doing fine"
(35). Tricks like defining that a substitution is always followed by a space unless punctuation is given or when it is meaningless (at the end of the string) allows you to further reduce this to "#1my #4is Bob, Im #2#3"
(22 chars, 37% saved).
There is a reason you don't see this sort of thing done much in the wild (and why it may not be worth your while to do at all unless you have a very specific use-case and set of simple constraints). Consider how you'd compress "doing 3 things" with my scheme above - "#23 things"
right? But what if your substitution dictionary has more than 22 entries, have you encoded word#23
or word#2
followed by '3'
? To accommodate this case you have to give something up (add complexity and probably lose a bit of compressibility). So I am sure you can see that making a bulletproof, general-purpose and worthwhile scheme is only feasible under strict, limited circumstances and even then will take careful thought.
Remember 1: The law of diminishing returns: how much can you save against the additional cost (complexity) necessary to achieve that saving.
Remember 2: The substitution mapping must either be hard-coded, saved in configuration or must be carried with compressed data!
Having said all of that, if you are wanting to save space on disk and know that your string will contain only characters that fall within the ASCII character set, you might halve the space requirement by re-encoding strings to ASCII (8 bit characters) from C#'s default (16 bit Unicode) (but take care to ensure that this is how they are written to disk by specifying the same encoding for the file). This can be done in addition to compression by substitution and of the two probably offers the larger space-saving.