Search code examples
xmlms-wordopenxmldocxopenxml-sdk

DOCX XML Does Not Represent Line Breaks Like Word Does?


I am using the Open XML SDK 2.5 to read .docx files in my console application.

There appears to be some discrepency between how Word displays the document and how the document is represented in XML when opened with the Open XML SDK.

Here is my example as seen in Word with whitespace visible:


enter image description here


So in my application I have a reference to this paragraph as a DocumentFormat.OpenXml.Wordprocessing.Paragraph object. After browsing the Open XML documentation it became clear to me that there is no representation of a "line" in the XML format. So the best I can do is have my Paragraph and the closest approximation to a line is the Run object. The Paragraph node has a collection of 6 Run objects in this example. If I get the InnerXml property of the Paragraph in this example here is how it looks:

<w:pPr xmlns:w=\"http://schemas.openxmlformats.org/wordprocessingml/2006/main\"><w:pStyle w:val=\"PlainText\" /><w:numPr><w:ilvl w:val=\"0\" /><w:numId w:val=\"17\" /></w:numPr><w:rPr><w:rFonts w:ascii=\"Arial\" w:hAnsi=\"Arial\" /><w:b /></w:rPr></w:pPr><w:r w:rsidRPr=\"000558F8\" xmlns:w=\"http://schemas.openxmlformats.org/wordprocessingml/2006/main\"><w:rPr><w:rFonts w:ascii=\"Arial\" w:hAnsi=\"Arial\" /></w:rPr><w:t>Should we use the term “Verify” instead of “Confirm”</w:t></w:r><w:r w:rsidRPr=\"000558F8\" w:rsidR=\"00F5335C\" xmlns:w=\"http://schemas.openxmlformats.org/wordprocessingml/2006/main\"><w:rPr><w:rFonts w:ascii=\"Arial\" w:hAnsi=\"Arial\" /></w:rPr><w:t xml:space=\"preserve\"> as per work instruction</w:t></w:r><w:r w:rsidRPr=\"000558F8\" w:rsidR=\"00411638\" xmlns:w=\"http://schemas.openxmlformats.org/wordprocessingml/2006/main\"><w:rPr><w:rFonts w:ascii=\"Arial\" w:hAnsi=\"Arial\" /></w:rPr><w:t>?</w:t></w:r><w:r w:rsidR=\"000558F8\" xmlns:w=\"http://schemas.openxmlformats.org/wordprocessingml/2006/main\"><w:rPr><w:rFonts w:ascii=\"Arial\" w:hAnsi=\"Arial\" /></w:rPr><w:br /><w:t>Med</w:t></w:r><w:r w:rsidRPr=\"000558F8\" w:rsidR=\"003E76BD\" xmlns:w=\"http://schemas.openxmlformats.org/wordprocessingml/2006/main\"><w:rPr><w:rFonts w:ascii=\"Arial\" w:hAnsi=\"Arial\" /><w:b /></w:rPr><w:br /><w:t xml:space=\"preserve\">JD: </w:t></w:r><w:r w:rsidRPr=\"000558F8\" w:rsidR=\"00A118AB\" xmlns:w=\"http://schemas.openxmlformats.org/wordprocessingml/2006/main\"><w:rPr><w:rFonts w:ascii=\"Arial\" w:hAnsi=\"Arial\" /><w:b /></w:rPr><w:t>Done.</w:t></w:r>

All I see are the paragraph properties node and the 6 run nodes. And as you can see the run nodes don't equate to lines. Looking at my example from within Word I see that the paragraph has 2 carriage returns and I would expect this to be represented by 3 "lines". However in XML I get 6 runs which seem to be a close approximation to the 3 lines but for some reason some lines are split up seemingly arbitrarily.

The REAL issue is that I don't see any way of interpreting the run nodes in a way that I could reconstruct the line structure I have in the example in Word. For instance, nothing indicates to me that runs 1, 2, and 3 together make up line 1.

I need to parse over 300 word documents that depend on the line breaks for formatting. I NEED the line breaks, how can I get them? Is this possible with Open XML SDK?

Thanks in advance.


Solution

  • The element you are looking for in your XML is the Break element which is <w:br />.

    From the documentation, this XML:

    <w:r>
        <w:t>This is</w:t>
        <w:br/>
        <w:t xml:space="preserve"> a simple sentence.</w:t>
    </w:r>
    

    Would produce

    This is
    a simple sentence.

    I've prettified your XML and marked the Breaks at the end of this answer.

    Runs are not used to determine lines, rather they are a logical block to contain text with the same properties. For example, imagine I had the following text:

    testing

    Note that the ing is in bold. In OpenXML this would require two runs, one for test and the other for ing as they have different properties. The XML would be something like this:

    <w:r>
        <w:t>Test</w:t>
    </w:r>
    <w:r w:rsidRPr="004750BC">
        <w:rPr>
           <w:b />
        </w:rPr>
        <w:t>ing</w:t>
    </w:r>
    

    The <w:rPr> are the run properties with <w:b /> denoting the bold.

    Your XML with the breaks highlighted:

    <w:pPr
        xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main">
        <w:pStyle w:val="PlainText" />
        <w:numPr>
            <w:ilvl w:val="0" />
            <w:numId w:val="17" />
        </w:numPr>
        <w:rPr>
            <w:rFonts w:ascii="Arial" w:hAnsi="Arial" />
            <w:b />
        </w:rPr>
    </w:pPr>
    <w:r w:rsidRPr="000558F8"
        xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main">
        <w:rPr>
            <w:rFonts w:ascii="Arial" w:hAnsi="Arial" />
        </w:rPr>
        <w:t>Should we use the term “Verify” instead of “Confirm”</w:t>
    </w:r>
    <w:r w:rsidRPr="000558F8" w:rsidR="00F5335C"
        xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main">
        <w:rPr>
            <w:rFonts w:ascii="Arial" w:hAnsi="Arial" />
        </w:rPr>
        <w:t xml:space="preserve"> as per work instruction</w:t>
    </w:r>
    <w:r w:rsidRPr="000558F8" w:rsidR="00411638"
        xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main">
        <w:rPr>
            <w:rFonts w:ascii="Arial" w:hAnsi="Arial" />
        </w:rPr>
        <w:t>?</w:t>
    </w:r>
    <w:r w:rsidR="000558F8"
        xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main">
        <w:rPr>
            <w:rFonts w:ascii="Arial" w:hAnsi="Arial" />
        </w:rPr>
        <w:br /> <!-- break here -->
        <w:t>Med</w:t>
    </w:r>
    <w:r w:rsidRPr="000558F8" w:rsidR="003E76BD"
        xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main">
        <w:rPr>
            <w:rFonts w:ascii="Arial" w:hAnsi="Arial" />
            <w:b />
        </w:rPr>
        <w:br />  <!-- break here -->
        <w:t xml:space="preserve">JD: </w:t>
    </w:r>
    <w:r w:rsidRPr="000558F8" w:rsidR="00A118AB"
        xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main">
        <w:rPr>
            <w:rFonts w:ascii="Arial" w:hAnsi="Arial" />
            <w:b />
        </w:rPr>
        <w:t>Done.</w:t>
    </w:r>