Search code examples
javascriptnode.jstypescriptbase64decoding

Base64 Decode embedded PDF in Typescript


Within an XML file we have a base64 encoded String representing a PDF file, that contains some table representations, i.e. similar to this example. When decoding the base64 string of that PDF document (i.e. such as this), we end up with a PDF document of 66 kB in size, which can be opened in any PDF viewer correctly.

On trying to decode that same base64 encoded string with Buffer in TypeScript (within a VSCode extension), i.e. with the functions below:

function decodeBase64(base64String: string): string {
    const buf: Buffer = Buffer.from(base64String, "base64");
    return buf.toString();
}

// the base64 encoded string is usually extracted from an XML file directly
// for testing purposes we load that base64 encoded string from a local file
const base64Enc: string = fs.readFileSync(".../base64Enc.txt", "ascii");
const base64Decoded: string = decodeBase64(base64Enc);

fs.writeFileSync(".../table.pdf", base64Decoded);

we end up with a PDF of 109 kB in size and a document that can't be opened using PDF viewers.

For a simple PDF, such as this one, with a base64 encoded string representation like this, the code above works and the PDF can be read in any PDF viewer.

I've also tried to directly read in the locally stored base64 encoded representation of the PDF file using

const buffer: string | Buffer = fs.readFileSync(".../base64Enc.txt", "base64");

though isn't producing something useful either.

Even with a slight adaptation of this suggestion, due to atob(...) not being present (with suggestions to replace atob with Buffer), which ended up in a code like this:

const buffer: string = fs.readFileSync(".../base64Enc.txt", "ascii");

// atob(...) is not present, other answers suggest to use Buffer for conversion
const binary: string = Buffer.from(buffer, 'base64').toString();
const arrayBuffer: ArrayBuffer = new ArrayBuffer(binary.length);
const uintArray: Uint8Array = new Uint8Array(arrayBuffer);

for (let i: number = 0; i < binary.length; i++) {
    uintArray[i] = binary.charCodeAt(i);
}
const decoded: string = Buffer.from(uintArray.buffer).toString();

fs.writeFileSync(".../table.pdf", decoded);

I'm not ending up with a readable PDF. The "decoded" table.pdf sample ends up with 109 kB in size.

What am I doing wrong here? How can I decode a PDF such as the table.pdf sample to obtain a readable PDF document, similar to the functionality provided by Notepad++?


Solution

  • Borrowing heavily from answers to How to get an array from ArrayBuffer?, if you get a Uint8Array right from the Buffer using the Uint8Array constructor:

    const buffer: string = fs.readFileSync(".../base64Enc.txt", "ascii");
    const uintArray: Uint8Array = new Uint8Array(Buffer.from(buffer, 'base64'));
    fs.writeFileSync(".../table.pdf", uintArray);
    

    Writing the Uint8Array directly to the file guarantees there's no corruption due to encoding changes from moving to and from strings.

    Just a note: the Uint8Array points to the same internal array of bytes as the Buffer. Not that it matters in this case, since this code doesn't reference the Buffer outside of the constructor, but in case someone decides to create a new variable for the output of Buffer.from(buffer, 'base64').