Search code examples
.nettextencodingradix

Make custom string encoder .net


I know .net supports base64 encoding of byte arrays. But i thought that i could save even more space if use a higher number of characters. I read somewhere that Unicode supports thousands of different characters so why not use base1024 encoding for example? And if this is possible can you give some guidelines on how to implement it. Thanks


Solution

  • Depending on whether you use 2 byte Unicode encoding (UCS2) or multi byte (UTF-8). Base 1024 would be only slightly better or even more wasteful of space than base64, since base 64 uses 6 bits out of an 8 bit byte. Raw binary data converted to base64 becomes 4/3 larger. (about 1.333x growth)

    But base1024 using UCS-2 (16 bit) Unicode characters would use only 10 of 16 bits, so it would take 8/5 the space. raw binary data converted to base1024 using UCS-2 would grow to 1.6 times its original size. This is worse than base64.

    If you used UTF-8 Unicode instead, and were careful to use only unicode characters that had 1 or 2 byte encodings, you could get at most 1920 more unique code points out of 2 characters, which works out to a slight improvement in data density. (UTF-8 encoding only uses 6 bits of each additional * bit byte to indicate code points, the other 2 bits are used to indicate that there are more bytes to follow)

    So this is not going to help, You should look into the possibility of compressing on your data before converting it to base64.