Search code examples
c#ms-wordresampling

MSWord batch resampling images


I have a few thousand word files which some of my colleagues have put together. They're not very technical people, and they've just taken their 10 megapixel cameras and embedded a few photos directly into the word files without resampling them. Often the images are scaled down to be quite small on the pages, say 3" by 2" approx.

I need to write some sort of tool to sequentially go through these, each ~300MB, word files and downsample the images, then save the word file.

We're dealing predominantly with .doc files, rather than .docx. There may be some powerpoint files also.

I have a few options available to me. I can write a program in C# which gives the user a nice interface allowing them to specify the DPI and JPEG quality when saving. Alternatively, I can use a VBA macro to do it, however I will probably need to either write a DLL or use a 3rd party one for the image resizing.

I've done some Excel importing from .xls and .xlsx files into C# and it was a breeze, however I suspect that writing downsampled images back to .doc files in such a way that the formatting looks unchanged may be tricky.

Can I get some input: Are there some free libraries (free for commercial use) for access .doc files which can do what I need them to do? If I were to write it in VBA, aside from the downsampling problem - are there any other obstacles I would face? Lastly, do you have an alternate suggestion on how to tackle this?


Solution

  • Okay, I haven't had any answers or comments in about a week so I'm going to answer my own question with what I've managed to learn in that time. I hope it will be beneficial for some other person later down the line.

    As I mentioned, we are dealing with thousands of office (word and powerpoint) files which have full-resolution digital camera images in them. The files can be anywhere up to several hundred MB, where they should be a few hundred KB to a few MB at most. It is causing a burden on the company network and it is also very slow for people to open these crucial documents.

    What I originally did was to unpackage the .doc files with 7-Zip. I used the command-line interface in a hidden System.Diagnostics.Process to extract "WordDocument" from the .doc file.

    Then, I would read through WordDocument byte-by-byte until I find the JPEG SOI marker: 0xFF 0xD8, and read until the EOI marker: 0xFF 0xD9. I would read in that fraction of the WordDocument as a stream into an Image, and resize it there. I would then save the image back to the WordDocument stream with a smaller resolution/smaller quality. I can confirm that the images were being read in correctly, and that they were being inserted into WordDocument correctly. We ended up with files much, much smaller than we started with. Unfortunately, 7-Zip allows you to extract these components from .doc files, but it does not appear to let you re-insert it. So all of that work was basically for nothing. I may be wrong about this, but my version (the latest at the moment), will not let me add files to a .doc package.

    Next, I re-wrote the function so that it uses the MS Office interop library. I open a Word.Application and a Word.Document, run Document.Convert() and then save it as a .docx file. A lot of the time this is sufficient, however sometimes we end up with a file only slightly smaller. Upon inspection of the GZip contents of the .docx files, it seems that the creator of the document has used Microsoft Photo Editor 3, which has somehow added about a few dozen MB worth of OLE information to the docx.

    So that is where I'm up to. I have outlined two methods above which I have tried. The first is a raw .doc editing technique which will only work if you can find a way to re-package WordDocument into the .doc - and I haven't tested it with PowerPoint files but I assume the process would be similar. The second method has the advantage of providing .docx and .pptx files which can be opened with a zip-compatible packaging library and the resources can be edited/deleted quite easily. Unfortunately, it means that Office needs to be installed on the machine and if you don't have a relatively new version of office then the Document.Convert() method will throw an exception.

    I hope that helps anyone reading this.