I'm trying to find & replace text passages in docx files with POI 3.8 like described here.
That works just fine if I insert my tags on the first run. But as soon as I re-open the docx file and do some modifications, Word actually does fragment across runs of text. So for example, "hello world" might be:
<w:r><w:t>Hello wo</w:t></w:r><w:r w:rsidR="00FB0672"><w:t>rld</w:t></w:r>
Such fragmentation will be due to things like change tracking, formatting, and spell checking I think.
Does anybody have an idea how to ...
a) ... disable this feature in MS Word?
b) ... somehow de-fragment the docx file afterwards?
c) ... any other solution to get rid of this fragmentation?
I already tried to save the file as .doc/.odt and re-save the file to .docx. But these fragmentations still persist...
Any help highly appreciated — thanks in advance for your help!
In Word, the features you want to and can turn off are spelling and grammar checking, and rsid insertion.
This is for docx4j (a project I manage), not POI, but VariablePrepare shows you what needs to be done to de-fragment if you can't prevent it in Word. Since POI uses a similar XML marshalling/unmarshalling approach (albeit XML Beans, not JAXB), you should be able to convert that code to use the POI API.