I must extract paragraphs (means: Headlines with their content) from a Word-document using XSLT. I have analyzed the structure and can reach the necessary nodes in the .docx-file with XSLT. But now i do not know how to group the content of the w:t
-tags between the headings because Word splits the texts in a very strange way.
The input-data looks like:
<w:body xmlns:w="somenamespace">
<w:p>
<w:pPr> <w:pStyle w:val="Heading1" /> </w:pPr>
<w:r> <w:t>My Headl</w:t> </w:r>
<w:r> <w:t>ine</w:t> </w:r>
</w:p>
<w:p>
<w:r> <w:t>text 1.1.1 </w:t> </w:r>
<w:r> <w:t>text 1.1.2 </w:t> </w:r>
</w:p>
<w:p>
<w:r> <w:t>text 1.2.1 </w:t> </w:r>
<w:r> <w:t>text 1.2.2 </w:t> </w:r>
</w:p>
<w:p>
<w:pPr> <w:pStyle w:val="Heading1" /> </w:pPr>
<w:r> <w:t>My seco</w:t> </w:r>
<w:r> <w:t>nd Headline</w:t> </w:r>
</w:p>
<w:p>
<w:r> <w:t>text 2.1.1 </w:t> </w:r>
<w:r> <w:t>text 2.1.2 </w:t> </w:r>
</w:p>
<w:p>
<w:r> <w:t>text 2.2.1 </w:t> </w:r>
<w:r> <w:t>text 2.2.2 </w:t> </w:r>
</w:p>
</w:body>
Concatenating the content of a single paragraph is no problem. So it is simple to merge the data to a compact structure like the following:
<Document>
<Paragraphs>
<Headline>My Headline</Headline>
<Content>text 1.1.1 text 1.1.2 </Content>
<Content>text 1.2.1 text 1.2.2 </Content>
<Headline>My second Headline</Headline>
<Content>text 2.1.1 text 2.1.2 </Content>
<Content>text 2.2.1 text 2.2.2 </Content>
</Paragraphs>
</Document>
But this structure is not always useful because it still does not have one xml-element for the content of one paragraph.
So does anyone know how to merge all paragraphs between the w:p
-elements which does represent a headline?
I would like to have an XSLT which transforms the w:body
-content to a structure like:
<Document>
<Paragraph>
<Headline>My Headline</Headline>
<Content>text 1.1.1 text 1.1.2 text 1.2.1 text 1.2.2 </Content>
</Paragraph>
<Paragraph>
<Headline>My second Headline</Headline>
<Content>text 2.1.1 text 2.1.2 text 2.2.1 text 2.2.2 </Content>
</Paragraph>
</Document>
What i have found yet:
If a w:p
-element contains a w:pPr
-element then it is always the first child-node of this w:p
-element
If a w:p
-element matches on this condition ./w:pPr/w:pStyle[@w:val='Heading1']>
then all w:r
-elements in this w:p
-element belongs to the headline of the paragraph.
This might be the solution for your problem. You need to use the for-each-group
statement in xslt. You can match the whole w:p
elements and define that the first element of a group is the w:p
in which the heading style is defined. After that you can get the items by using the current-group
function which gives you the while node-array of the group.
XSLT:
<xsl:stylesheet version="3.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:w="somenamespace">
<xsl:output method="xml" omit-xml-declaration="yes" />
<xsl:template match="w:body">
<Document>
<xsl:for-each-group select="w:p" group-starting-with="*[./w:pPr/w:pStyle[@w:val='Heading1']]">
<xsl:element name="Paragraph">
<xsl:element name="Headline">
<xsl:value-of select="current-group()[1]/*/w:t/text()" />
</xsl:element>
<xsl:element name="Content">
<xsl:for-each select="current-group()[position()>1]/*">
<xsl:copy-of select="./w:t/text()" />
</xsl:for-each>
</xsl:element>
</xsl:element>
</xsl:for-each-group>
</Document>
</xsl:template>
<xsl:template match="*|node()">
<xsl:apply-templates />
</xsl:template>
</xsl:stylesheet>
Output:
<Document xmlns:w="somenamespace">
<Paragraph>
<Headline>My Headline</Headline>
<Content>text 1.1.1 text 1.1.2 text 1.2.1 text 1.2.2 </Content>
</Paragraph>
<Paragraph>
<Headline>My second Headline</Headline>
<Content>text 2.1.1 text 2.1.2 text 2.2.1 text 2.2.2 </Content>
</Paragraph>
</Document>