Search code examples
xmlxsltopenxml

Extract text from Word-Document using XSLT


I must extract paragraphs (means: Headlines with their content) from a Word-document using XSLT. I have analyzed the structure and can reach the necessary nodes in the .docx-file with XSLT. But now i do not know how to group the content of the w:t-tags between the headings because Word splits the texts in a very strange way.

The input-data looks like:

<w:body xmlns:w="somenamespace">
   <w:p>
      <w:pPr> <w:pStyle w:val="Heading1" /> </w:pPr>
      <w:r> <w:t>My Headl</w:t> </w:r>
      <w:r> <w:t>ine</w:t> </w:r>
   </w:p>
   <w:p>
      <w:r> <w:t>text 1.1.1 </w:t> </w:r>
      <w:r> <w:t>text 1.1.2 </w:t> </w:r>
   </w:p>
   <w:p>
      <w:r> <w:t>text 1.2.1 </w:t> </w:r>
      <w:r> <w:t>text 1.2.2 </w:t> </w:r>
   </w:p>
   <w:p>
      <w:pPr> <w:pStyle w:val="Heading1" /> </w:pPr>
      <w:r> <w:t>My seco</w:t> </w:r>
      <w:r> <w:t>nd Headline</w:t> </w:r>
   </w:p>
   <w:p>
      <w:r> <w:t>text 2.1.1 </w:t> </w:r>
      <w:r> <w:t>text 2.1.2 </w:t> </w:r>
   </w:p>
   <w:p>
      <w:r> <w:t>text 2.2.1 </w:t> </w:r>
      <w:r> <w:t>text 2.2.2 </w:t> </w:r>
   </w:p>
</w:body>

Concatenating the content of a single paragraph is no problem. So it is simple to merge the data to a compact structure like the following:

<Document>
    <Paragraphs>
        <Headline>My Headline</Headline>
        <Content>text 1.1.1 text 1.1.2 </Content>
        <Content>text 1.2.1 text 1.2.2 </Content>
        <Headline>My second Headline</Headline>
        <Content>text 2.1.1 text 2.1.2 </Content>
        <Content>text 2.2.1 text 2.2.2 </Content>
    </Paragraphs>
</Document>

But this structure is not always useful because it still does not have one xml-element for the content of one paragraph. So does anyone know how to merge all paragraphs between the w:p-elements which does represent a headline? I would like to have an XSLT which transforms the w:body-content to a structure like:

<Document>
    <Paragraph>
        <Headline>My Headline</Headline>
        <Content>text 1.1.1 text 1.1.2 text 1.2.1 text 1.2.2 </Content>
    </Paragraph>
    <Paragraph>
        <Headline>My second Headline</Headline>
        <Content>text 2.1.1 text 2.1.2 text 2.2.1 text 2.2.2 </Content>
    </Paragraph>
</Document>

What i have found yet:

  • If a w:p-element contains a w:pPr-element then it is always the first child-node of this w:p-element

  • If a w:p-element matches on this condition ./w:pPr/w:pStyle[@w:val='Heading1']> then all w:r-elements in this w:p-element belongs to the headline of the paragraph.


Solution

  • This might be the solution for your problem. You need to use the for-each-group statement in xslt. You can match the whole w:p elements and define that the first element of a group is the w:p in which the heading style is defined. After that you can get the items by using the current-group function which gives you the while node-array of the group.

    XSLT:

    <xsl:stylesheet version="3.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:w="somenamespace">
      <xsl:output method="xml" omit-xml-declaration="yes" />
    
    
      <xsl:template match="w:body">
        <Document>
          <xsl:for-each-group select="w:p" group-starting-with="*[./w:pPr/w:pStyle[@w:val='Heading1']]">
                <xsl:element name="Paragraph">
                    <xsl:element name="Headline">
                        <xsl:value-of select="current-group()[1]/*/w:t/text()" />
                    </xsl:element>
                    <xsl:element name="Content">
                        <xsl:for-each select="current-group()[position()>1]/*">
                                <xsl:copy-of select="./w:t/text()" />
                        </xsl:for-each>
                    </xsl:element>
                </xsl:element>
          </xsl:for-each-group>
        </Document>
      </xsl:template>
    
      <xsl:template match="*|node()">
        <xsl:apply-templates />
      </xsl:template>
    </xsl:stylesheet>
    

    Output:

    <Document xmlns:w="somenamespace">
      <Paragraph>
        <Headline>My Headline</Headline>
        <Content>text 1.1.1 text 1.1.2 text 1.2.1 text 1.2.2 </Content>
      </Paragraph>
      <Paragraph>
        <Headline>My second Headline</Headline>
        <Content>text 2.1.1 text 2.1.2 text 2.2.1 text 2.2.2 </Content>
      </Paragraph>
    </Document>