Search code examples
.netopenxml-sdkdocxmergefield

Unusual XML notations of mergefields in DOCX file.


In our document generation system we use DOCX files in which we programmatically fill out mergefields. For this I'm using the OpenXml SDK 2.0.

I've been plowing through the document.xml file, in de docx, and found that the mergefields are usually represented by a SimpleField. An example from a document we use:

<w:fldSimple w:instr=" MERGEFIELD  NP021_INSSNumber  \* MERGEFORMAT "><w:r><w:rPr><w:noProof/></w:rPr><w:t>«NP021_INSSNumber»</w:t></w:r></w:fldSimple>

A fairly straightforward notation, containing the mergefield command and the text to be displayed in the document. It's fairly easy to find this tag in the XML, just search for w:fldSimple tags. (I removed some style tags to make it more readable)

But a document recently created in Word didn't parse in our code, and when I looked in the XML the notation for mergefields was completely different:

<w:instrText xml:space="preserve"> MERGEFIELD  NP021_INSSNumber  \* MERGEFORMAT </w:instrText>

And later in the document I found the display notation: <w:t>«NP021_INSSNumber»</w:t> This is spectacularly impossible to parse in code.

How is it possible that doing the same thing in Word can have such different results, and is there a way to ensure that Word uses SimpleFields as XML notation for mergefields?

Thank you in advance for any helpful input.


Solution

  • I would consider accepting revisions and simplifying the markup prior to parsing it.

    Note that the MarkupSimplifier is included in the Power Tools for Open XML.

    You will probably find lots more useful material in Eric Whites blog postings.