Search code examples
c#.netms-wordopenxml

OpenXML tag search


I'm writing a .NET application that should read a .docx file nearby 200 pages long (trough DocumentFormat.OpenXML 2.5) to find all the occurences of certain tags that the document should contain. To be clear I'm not looking for OpenXML tags but rather tags that should be set into the document by the document writer as placeholder for values I need to fill up in a second stage. Such tags should be in the following format:

 <!TAG!>

(where TAG can be an arbitrary sequence of characters). As I said I have to find all the occurences of such tags plus (if possibile) locating the 'page' where the tag occurence have been found. I found something looking around in the web but more than once the base approach was to dump all the content of the file in a string and then look inside such string regardless of the .docx encoding. This either caused false positive or no match at all (while the test .docx file contains several tags), other examples were probably a little over my knowledge of OpenXML. The regex pattern to find such tags should be something of this kind:

<!(.)*?!>

The tag can be found all over the document (inside table, text, paragraph, as also header and footer).

I'm coding in Visual Studio 2013 .NET 4.5 but I can get back if needed. P.S. I would prefer code without use of Office Interop API since the destination platform will not run Office.

The smallest .docx example I can produce store this inside document

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<w:document xmlns:wpc="http://schemas.microsoft.com/office/word/2010/wordprocessingCanvas" xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math" xmlns:v="urn:schemas-microsoft-com:vml" xmlns:wp14="http://schemas.microsoft.com/office/word/2010/wordprocessingDrawing" xmlns:wp="http://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing" xmlns:w10="urn:schemas-microsoft-com:office:word" xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main" xmlns:w14="http://schemas.microsoft.com/office/word/2010/wordml" xmlns:w15="http://schemas.microsoft.com/office/word/2012/wordml" xmlns:wpg="http://schemas.microsoft.com/office/word/2010/wordprocessingGroup" xmlns:wpi="http://schemas.microsoft.com/office/word/2010/wordprocessingInk" xmlns:wne="http://schemas.microsoft.com/office/word/2006/wordml" xmlns:wps="http://schemas.microsoft.com/office/word/2010/wordprocessingShape" mc:Ignorable="w14 w15 wp14">
<w:body>
<w:p w:rsidR="00CA7780" w:rsidRDefault="00815E5D">
  <w:pPr>
    <w:rPr>
      <w:lang w:val="en-GB"/>
    </w:rPr>
  </w:pPr>
  <w:r>
    <w:rPr>
      <w:lang w:val="en-GB"/>
    </w:rPr>
    <w:t>TRY</w:t>
  </w:r>
</w:p>
<w:p w:rsidR="00815E5D" w:rsidRDefault="00815E5D">
  <w:pPr>
    <w:rPr>
      <w:lang w:val="en-GB"/>
    </w:rPr>
  </w:pPr>
  <w:proofErr w:type="gramStart"/>
  <w:r>
    <w:rPr>
      <w:lang w:val="en-GB"/>
    </w:rPr>
    <w:t>&lt;!TAG1</w:t>
  </w:r>
  <w:proofErr w:type="gramEnd"/>
  <w:r>
    <w:rPr>
      <w:lang w:val="en-GB"/>
    </w:rPr>
    <w:t>!&gt;</w:t>
  </w:r>
</w:p>
<w:p w:rsidR="00815E5D" w:rsidRPr="00815E5D" w:rsidRDefault="00815E5D">
  <w:pPr>
    <w:rPr>
      <w:lang w:val="en-GB"/>
    </w:rPr>
  </w:pPr>
  <w:r>
    <w:rPr>
      <w:lang w:val="en-GB"/>
    </w:rPr>
    <w:t>TRY2</w:t>
  </w:r>
  <w:bookmarkStart w:id="0" w:name="_GoBack"/>
  <w:bookmarkEnd w:id="0"/>
</w:p>
<w:sectPr w:rsidR="00815E5D" w:rsidRPr="00815E5D">
  <w:pgSz w:w="11906" w:h="16838"/>
  <w:pgMar w:top="1417" w:right="1134" w:bottom="1134" w:left="1134" w:header="708" w:footer="708" w:gutter="0"/>
  <w:cols w:space="708"/>
  <w:docGrid w:linePitch="360"/>
</w:sectPr>
</w:body>
</w:document>

Best Regards, Mike


Solution

  • The problem with trying to find tags is that words are not always in the underlying XML in the format that they appear to be in Word. For example, in your sample XML the <!TAG1!> tag is split across multiple runs like this:

    <w:r>
        <w:rPr>
            <w:lang w:val="en-GB"/>
        </w:rPr>
        <w:t>&lt;!TAG1</w:t>
    </w:r>
    <w:proofErr w:type="gramEnd"/>
        <w:r>
        <w:rPr>
            <w:lang w:val="en-GB"/>
        </w:rPr>
        <w:t>!&gt;</w:t>
    </w:r>
    

    As pointed out in the comments this is sometimes caused by the spelling and grammar checker but that's not all that can cause it. Having different styles on parts of the tag could also cause it for example.

    One way of handling this is to find the InnerText of a Paragraph and compare that against your Regex. The InnerText property will return the plain text of the paragraph without any formatting or other XML within the underlying document getting in the way.

    Once you have your tags, replacing the text is the next problem. Due to the above reasons you can't just replace the InnerText with some new text as it wouldn't be clear as to which parts of the text would belong in which Run. The easiest way round this is to remove any existing Run's and add a new Run with a Text property containing the new text.

    The following code shows finding the tags and replacing them immediately rather than using two passes as you suggest in your question. This was just to make the example simpler to be honest. It should show everything you need.

    private static void ReplaceTags(string filename)
    {
        Regex regex = new Regex("<!(.)*?!>", RegexOptions.Compiled);
    
        using (WordprocessingDocument wordDocument = WordprocessingDocument.Open(filename, true))
        {
            //grab the header parts and replace tags there
            foreach (HeaderPart headerPart in wordDocument.MainDocumentPart.HeaderParts)
            {
                ReplaceParagraphParts(headerPart.Header, regex);
            }
            //now do the document
            ReplaceParagraphParts(wordDocument.MainDocumentPart.Document, regex);
            //now replace the footer parts
            foreach (FooterPart footerPart in wordDocument.MainDocumentPart.FooterParts)
            {
                ReplaceParagraphParts(footerPart.Footer, regex);
            }
        }
    }
    
    private static void ReplaceParagraphParts(OpenXmlElement element, Regex regex)
    {
        foreach (var paragraph in element.Descendants<Paragraph>())
        {
            Match match = regex.Match(paragraph.InnerText);
            if (match.Success)
            {
                //create a new run and set its value to the correct text
                //this must be done before the child runs are removed otherwise
                //paragraph.InnerText will be empty
                Run newRun = new Run();
                newRun.AppendChild(new Text(paragraph.InnerText.Replace(match.Value, "some new value")));
                //remove any child runs
                paragraph.RemoveAllChildren<Run>();
                //add the newly created run
                paragraph.AppendChild(newRun);
            }
        }
    }
    

    One downside with the above approach is that any styles you may have had will be lost. These could be copied from the existing Run's but if there are multiple Run's with differing properties you'll need to work out which ones you need to copy where. There's nothing to stop you creating multiple Run's in the above code each with different properties if that's what is required. Other elements would also be lost (e.g. any symbols) so those would need to be accounted for too.