Search code examples
c#encodingbase64

Different results after encoding/decoding base64


I have the following base64 string:

R1NNQiBBZ2VuY3kgR21iSCAvIFdlYmRlc2lnbiBBZ2VudHVyIFVsbSAvIE9ubGluZXNob3AgQWdlbnR1ciAvIEFwcCBBZ2VudHVyIFVsbSwgR2VybWFueS==

And using an online base64 decoder I get the following result:

GSMB Agency GmbH / Webdesign Agentur Ulm / Onlineshop Agentur / App Agentur Ulm, Germany

All good, right? But now if I try to convert this text back to base64 - the result is becomes

R1NNQiBBZ2VuY3kgR21iSCAvIFdlYmRlc2lnbiBBZ2VudHVyIFVsbSAvIE9ubGluZXNob3AgQWdlbnR1ciAvIEFwcCBBZ2VudHVyIFVsbSwgR2VybWFueQ==

Any ideas?

This is the C# code I am using for decoding:

string basestring = "R1NNQiBBZ2VuY3kgR21iSCAvIFdlYmRlc2lnbiBBZ2VudHVyIFVsbSAvIE9ubGluZXNob3AgQWdlbnR1ciAvIEFwcCBBZ2VudHVyIFVsbSwgR2VybWFueS==";

string output = Encoding.UTF8.GetString(Convert.FromBase64String(basestring));

return output;

And here's the encoding part

string basestring = "GSMB Agency GmbH / Webdesign Agentur Ulm / Onlineshop Agentur / App Agentur Ulm, Germany";

string output = Convert.ToBase64String(Encoding.UTF8.GetBytes(basestring));

return output;

Solution

  • This is actually an artefact of moving from 8-bit encoding (UTF8) to a 6-bit encoding (Base64).
    As reference, here's the Base64 encoding table

    We'll take an example of the string "AB"; A and B are char(65 and 66) respectively. In 8-bit binary grouping, 65/66 are 01000001/01000010.

    Encoding

    When encoding to Base64, the same bits of your string are separated in groups of 6 instead of 8. So the same 16-bit sequence above are split into 010000/010100/0010 (same bit pattern, just grouped differently).

    Now, the first two groups are easy. You look up the encoding table linked above, and you'll see that 010000 = Q / 010100 = U. You then have the last group with only 4 bits instead of the expected 6. This is where things get interesting.

    When encoding, the end is usually padded with zeroes to get to 6 bits. So your 0010 becomes 001000 which is I. So "AB" when encoded in Base64 become "QUI=". The = is optional, it's just there to make the number of characters multiples of 4.

    Decoding

    Remember when your last group of 0010 is padded to become 6 bits? Here's the fun part: they don't have to be zeroes. The 16-bits (2x8) in your original string became 18-bits (3x6) because of the padding. Since 18 is not a multiple of 8 (bits), the encoder/decoder know enough to drop the excess bits. So the two bit padding could be anything, and they'll still decode properly.

    0010 when padded could either be 001000, 001001, 001010, or 001011 - which translates to I, J, K, or L. Bring up any decoder, and try decoding QUI, QUJ, QUK, and QUL. They will all decode to "AB"

    Your string

    Now, your string when split 6-bit groups looks like the following (see fiddle):

    var basestring = "GSMB Agency GmbH / Webdesign Agentur Ulm / Onlineshop Agentur / App Agentur Ulm, Germany";
    var sixBitGroups = Encoding.UTF8.GetBytes(basestring)
      .SelectMany(b => $"{Convert.ToString(b, 2).PadLeft(8,'0')}")
      .Chunk(6)
      .Select(c => new string(c.ToArray()));
    string.Join("/", sixBitGroups).Dump();
    

    You'll notice that it ends with ../01. That 01 needs to be padded with 4 extra bits. Again, usually, they're zeroes, making it 010000 which is Q. So you'll see your encoded string ends with ..FueQ==. But when you realise that they don't have to be all zeroes, you'll see in the table that 01xxxx covers everything from Q,R,S, .. i,j. This explains why your base64 ..FueS== still decode to the exact same string.