Search code examples
javascriptnode.jsxmlopenxmldocx

NodeJS Merge Docx Files Using Buffer.concat() gives "word found unreadable content"


I have a simple a.docx file which only contains the word a in it (just a simple example to make the point). I read in the file using readFileSync which gives a buffer. My goal is to append another buffer to it (from another .docx file) and end up with a merged .docx file.

Here is what I did to test:

var buf = fs.readFileSync(path.resolve(__dirname, "a.docx"));
console.log(buf);
console.log(Buffer.concat([buf, buf]));

Gives:

<Buffer 50 4b 03 04 0a 00 00 00 00 00 00 00 21 00 df a4 d2 6c 20 05 00 00 20 05 00 00 13 00 00 00 5b 43 6f 6e 74 65 6e 74 5f 54 79 70 65 73 5d 2e 78 6d 6c 3c ... 51319 more bytes>
<Buffer 50 4b 03 04 0a 00 00 00 00 00 00 00 21 00 df a4 d2 6c 20 05 00 00 20 05 00 00 13 00 00 00 5b 43 6f 6e 74 65 6e 74 5f 54 79 70 65 73 5d 2e 78 6d 6c 3c ... 102688 more bytes>

As expected (double the bytes). However, when I save the new concatenated buffer with the following:

fs.writeFileSync(
    path.resolve(__dirname, "hello.docx"),
    Buffer.concat([buf, buf])
  );

... and try to open the resulting hello.docx file, I get: Output Document Error

If I click "Yes", the document simply displays the original a.docx file without the expected duplication - why is this happening?


Solution

  • Concatenating a DOCX file to itself at the byte level simply doesn't produce a viable DOCX file.

    You've entirely disregarded multiple levels of abstraction, specified in thousands of pages of standards documents. Read up on OOXML and WordProcessingML. Minimally, you'll want to use Zip and XML processing libraries. More likely, you'll want to find a higher-level DOCX/OOXML library to help.

    Word just happened to be able to repair your mangled "DOCX" file to show you its original form. That speaks more to the robustness of Word's repair capabilities than it does to the file being a proper DOCX document.