I uploaded some data that was incorrectly encoded. I took a JSON string (data
) and passed it into the Buffer.from(data, "base64")
. So Buffer.from
expected a Base64 string but received a UTF-8 string. And that buffer was then uploaded to S3.
When I downloaded the JSON files I was expecting a JSON string and instead got stuff that looked like this:
�ǫ�.�Ɲ�)��ę����h�*��a���r��w(c˰�� W�G}��n�^�'2��V�߃r)���H�ץ1�I}t�^�
i��l�����ߢ�L��G�y�(
����m4�&�j���;뫬z���x���\�^�
Code to recreate the error:
const data = {
userGroups: '["admin"]',
userEmail: "[email protected]",
id: "yhjy7C5pCBX_kd9xckVbt",
device:
"Mozilla/3.4 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/533.36 (KHTML, like Gecko) Chrome/152.0.0.0 Safari/5647.66",
userCognitoId: "c75de28",
};
const recreateFileCorruption = () => {
const jsonPrettyString = JSON.stringify(data);
const body = new Buffer.from(jsonPrettyString, "base64");
return body;
};
const runTest = () => {
const data = recreateFileCorruption();
console.log("Broken data string:", data.toString());
console.log("Somewhat decoded string", Buffer.from(data.toString("base64"), "utf-8").toString());
};
Running the runTest
function outputs this to the terminal:
Broken data string: �ǫ�.�Ɲ�)��ę����h�*��a���r��w(c˰�� W�G}��n�^�'2��V�߃r)���H�ץ1�I}t�^�
i��l�����ߢ�L��G�y�(
����m4�&�j���;뫬z���x���\�^�
Is there any way to recover the original JSON string? Or is that data lost forever?
It's lost.
I checked the length with
console.log(JSON.stringify(data).length);
and got 249 characters. If you have 249 Base64 characters and each one encodes 6 bits, you would expect a resulting buffer of 249*6/8 = 186 bytes.
But the actual size of the data buffer after this
const data = recreateFileCorruption();
console.log(data.length)
is only 139 bytes. That is because
new Buffer.from(jsonPrettyString, "base64");
just skips every non-Base64 character, and a JSON-string is full of them, e.g. {}":,[]
and others. Buffer.from
is even so nice to accept the original Base64 characters as well as the Base64URL variant ("-" as a substitute for "+" and "_" for "/").
It's should be no surprise that the non-Base64 characters are skipped, because there is no 6-bit value associated with them and therefore a decoder can't handle them. The only alternative here would be to raise an exception.
I also removed all non-Base64 characters manually from the string and got a length of 186 characters which after decoding will result in a Buffer of 139 bytes. Same as we saw above.
So
console.log("Somewhat decoded string", Buffer.from(data.toString("base64"), "utf-8").toString());
Somewhat decoded string userGroupsadminuserEmailhodoriron+thronecomidyhjy7C5pCBX/kd9xckVbtdeviceMozilla/34MacintoshIntelMacOSX10/15/7AppleWebKit/53336KHTMLlikeGeckoChrome/152000Safari/564766userCognitoIdc75de2w==
is the best you can get, there's nothing else to restore at this point.