The problem I have is, that when I create a docx-document with an embedded (ole) File of type .pdf the generated binary file in the /embeddings folder is larger than the original document.
I have inserted a document with size 52076 Bytes. If I rename the .docx to zip and open it the oleObject1.bin has 55296 Bytes
Now when I want to extract the file with Apache POI the file is there but corrupted.
Any ideas? (I first thought that it is maybe compressed?)
Thx
Ok I found the issue:
for docx for example there are some data-blocks before the file (RootEntry, ObjInfo, Contents,..). With an hex-editor you will see that the file starts somewhere behind. I fixed my extractor by looking which type the directory is - for pdf you have to look into the CONTENTS directory-entry:
private void writeBinaryPackagePart(PackagePart part, File targetfolder, String extension, String fileName) throws IOException {
if (StringUtils.isEmpty(fileName)) {
fileName = generateUniqueId(OleExtractorUtils.OfficeType.BINARY).concat(".").concat(extension);
}
InputStream inputStream = FileMagic.prepareToCheckMagic(part.getInputStream());
try {
if (FileMagic.valueOf(inputStream) == FileMagic.OLE2) {
try (NPOIFSFileSystem npoifsFileSystem = new NPOIFSFileSystem(inputStream)) {
if (isOle10Native(npoifsFileSystem.getRoot())) {
byte[] dataBuffer = Ole10Native.createFromEmbeddedOleObject(npoifsFileSystem.getRoot()).getDataBuffer();
writeOle10NativeObject(dataBuffer, fileName, targetfolder);
}
else if (npoifsFileSystem.getRoot().getEntryNames().contains("CONTENTS"))
try (DocumentInputStream contents = npoifsFileSystem.createDocumentInputStream("CONTENTS")) {
writeOle10NativeObject(IOUtils.toByteArray(contents), fileName, targetfolder);
}
}
}
}
catch (Exception e) {
LOGGER.warn("Cannot create Ole10Native from Object {}! Writing the following binary: {}", part.getPartName(), fileName);
ServiceUtil.moveUploadedFileToExistingTempFolder(inputStream, fileName, targetfolder);
inputStream.close();
}
}
private boolean isOle10Native(DirectoryNode directoryNode) {
String ole10Native = Ole10Native.OLE10_NATIVE;
Iterator<Entry> entries = directoryNode.getEntries();
while(entries.hasNext()) {
Entry entry = entries.next();
if (entry.getName().contains(ole10Native)) {
return true;
}
}
return false;
}