Search code examples
javascriptjsonencodingbase64buffer

Recovering original json string incorrectly encoded using Buffer.from(..., "base64")


I uploaded some data that was incorrectly encoded. I took a JSON string (data) and passed it into the Buffer.from(data, "base64"). So Buffer.from expected a Base64 string but received a UTF-8 string. And that buffer was then uploaded to S3. When I downloaded the JSON files I was expecting a JSON string and instead got stuff that looked like this:

�ǫ�.�Ɲ�)��ę����h�*��a���r��w(c˰�� W�G}��n�^�'2��V�߃r)���H�ץ1�I}t�^�
i��l�����ߢ�L��G�y�(
����m4�&�j���;뫬z���x���\�^�

Code to recreate the error:

const data = {
  userGroups: '["admin"]',
  userEmail: "[email protected]",
  id: "yhjy7C5pCBX_kd9xckVbt",
  device:
    "Mozilla/3.4 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/533.36 (KHTML, like Gecko) Chrome/152.0.0.0 Safari/5647.66",
  userCognitoId: "c75de28",
};

const recreateFileCorruption = () => {
  const jsonPrettyString = JSON.stringify(data);

  const body = new Buffer.from(jsonPrettyString, "base64");
  return body;
};

const runTest = () => {
  const data = recreateFileCorruption();
  console.log("Broken data string:", data.toString());
  console.log("Somewhat decoded string", Buffer.from(data.toString("base64"), "utf-8").toString());
};

Running the runTest function outputs this to the terminal:

Broken data string: �ǫ�.�Ɲ�)��ę����h�*��a���r��w(c˰�� W�G}��n�^�'2��V�߃r)���H�ץ1�I}t�^�
i��l�����ߢ�L��G�y�(
����m4�&�j���;뫬z���x���\�^�

Is there any way to recover the original JSON string? Or is that data lost forever?


Solution

  • It's lost.

    I checked the length with

    console.log(JSON.stringify(data).length);
    

    and got 249 characters. If you have 249 Base64 characters and each one encodes 6 bits, you would expect a resulting buffer of 249*6/8 = 186 bytes.

    But the actual size of the data buffer after this

    const data = recreateFileCorruption();
    console.log(data.length)
    

    is only 139 bytes. That is because

    new Buffer.from(jsonPrettyString, "base64");
    

    just skips every non-Base64 character, and a JSON-string is full of them, e.g. {}":,[] and others. Buffer.from is even so nice to accept the original Base64 characters as well as the Base64URL variant ("-" as a substitute for "+" and "_" for "/").

    It's should be no surprise that the non-Base64 characters are skipped, because there is no 6-bit value associated with them and therefore a decoder can't handle them. The only alternative here would be to raise an exception.

    I also removed all non-Base64 characters manually from the string and got a length of 186 characters which after decoding will result in a Buffer of 139 bytes. Same as we saw above.

    So

     console.log("Somewhat decoded string", Buffer.from(data.toString("base64"), "utf-8").toString());
    

    Somewhat decoded string userGroupsadminuserEmailhodoriron+thronecomidyhjy7C5pCBX/kd9xckVbtdeviceMozilla/34MacintoshIntelMacOSX10/15/7AppleWebKit/53336KHTMLlikeGeckoChrome/152000Safari/564766userCognitoIdc75de2w==

    is the best you can get, there's nothing else to restore at this point.