Search code examples
xmlxsltxpathxslt-1.0xslt-grouping

XHTML to Structured XML with XSLT 1.0


I have an XHTML document from a basic ePub output that I'm trying to convert into a structured XML document. The format of it shouldn't be too crazy in general and looks like the following:

<?xml version="1.0" encoding="utf-8"?>
<html>
<body>
  <h1>Topic 1</h1>
  <p>1.0.1</p>
  <p>1.0.2</p>

  <h2>Subtopic 1.1</h2>
  <p>1.1.1</p>
  <p>1.1.2</p>

  <h2>Subtopic 1.2</h2>
  <p>1.2.1</p>
  <p>1.2.2</p>

  <h1>Topic 2</h1>
  <p>2.0.1</p>
  <p>2.0.2</p>

  <h2>Subtopic 2.1</h2>
  <p>2.1.1</p>
  <p>2.1.2</p>

  <h2>Subtopic 2.2</h2>
  <p>2.2.1</p>
  <p>2.2.2</p>
</body>
</html>

Ideally, I'd like to convert this into some structured code, based on the h1, h2, ... tags. The stuff after the first h1, but before the second should be contained inside its own container, and the stuff in the second h1 to the end of the document inside its own. Likewise, the stuff between h2's should also go into a container, thereby nesting it. The output should be something like this:

<Root>
   <Topic>
      <Title>Topic 1</Title>
      <Paragraph>1.0.1</Paragraph>
      <Paragraph>1.0.2</Paragraph>
      <Topic>
         <Title>Subtopic 1.1</Title>
         <Paragraph>1.1.1</Paragraph>
         <Paragraph>1.1.2</Paragraph>
      </Topic>
      <Topic>
         <Title>Subtopic 1.2</Title>
         <Paragraph>1.2.1</Paragraph>
         <Paragraph>1.2.2</Paragraph>
      </Topic>
   </Topic>
   <Topic>
      <Title>Topic 2</Title>
      <Paragraph>2.0.1</Paragraph>
      <Paragraph>2.0.2</Paragraph>
      <Topic>
         <Title>Subtopic 2.1</Title>
         <Paragraph>2.1.1</Paragraph>
         <Paragraph>2.1.2</Paragraph>
      </Topic>
      <Topic>
         <Title>Subtopic 2.2</Title>
         <Paragraph>2.2.1</Paragraph>
         <Paragraph>2.2.2</Paragraph>
      </Topic>
   </Topic>
</Root>

Even though the example only consists of p tags, it may also contain div's, and other elements, so don't count on it being just one node. It needs to be generic enough to not care what's between the header tags.

I'm familiar with Muenchian grouping, but this is a bit complex of a situation for me. I've tried using keys like this:

<xsl:key name="kHeaders1" match="*[not(self::h1)]" use="generate-id(preceding-sibling::h1[1])"/>

<xsl:template match="h1">
  <Topic>
    <Title><xsl:apply-templates /></Title>
    <xsl:apply-templates select="key('kHeaders1', generate-id())" />
  </Topic>
</xsl:template>

<xsl:template match="html">
  <Root>
     <xsl:apply-templates select="body/h1" />
  </Root>
</xsl:template>

<xsl:template match="p">
   <Paragraph><xsl:apply-templates /></Paragraph>
</xsl:template>

This works well enough for the first level, but then trying to repeat the process, but using h2, seems to break my mind. Since at the h2 level, the key for any node should be the first, h1 or h2 sibling. It almost seems like it could be combined into a single set of keys, where the id is whatever the last h* was that came before it, and where the h* elements are not listed in the grouping (so that they don't recurse). I would imagine something like:

<xsl:key name="kHeaders" match="*[not(self::h1 or self::h2)]" use="generate-id(preceding-sibling::*[self::h1 or self::h2][1])"/>

However, that leaves out the h2 elements from the list, which need to be present in the grouping for the previous h1. And if I relax the restrictions on the match to include the h1/h2 elements (and make the h1 templates also match h2), then I get the h2's re-listing the h1's and so on (somewhat expected).

An ideal solution is one that can be extended to work for h3, h4, and so on without a lot of effort. However, it does not need to include script elements for handling the generic h* elements. Simple instructions for how to add an additional layer would be sufficient.

Does anyone have some advice here?


Solution

  • Below stylesheet(most of the essential code copied from this answer) would work when more headers are involved:

    <xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
        <xsl:output method="xml" indent="yes"/>
        <xsl:strip-space elements="*"/>
    
        <xsl:key name="next-headings" match="h6"
              use="generate-id(preceding-sibling::*[self::h1 or self::h2 or
                                                   self::h3 or self::h4 or
                                                   self::h5][1])" />
    
        <xsl:key name="next-headings" match="h5"
              use="generate-id(preceding-sibling::*[self::h1 or self::h2 or
                                                   self::h3 or self::h4][1])" />
        <xsl:key name="next-headings" match="h4"
              use="generate-id(preceding-sibling::*[self::h1 or self::h2 or
                                                   self::h3][1])" />
        <xsl:key name="next-headings" match="h3"
              use="generate-id(preceding-sibling::*[self::h1 or self::h2][1])" />
    
        <xsl:key name="next-headings" match="h2"
              use="generate-id(preceding-sibling::h1[1])" />
    
        <xsl:key name="immediate-nodes"
              match="node()[not(self::h1 | self::h2 | self::h3 | self::h4 |
                               self::h5 | self::h6)]"
              use="generate-id(preceding-sibling::*[self::h1 or self::h2 or
                                                   self::h3 or self::h4 or
                                                   self::h5 or self::h6][1])" />
    
        <xsl:template match="/">
            <Root>
                <xsl:apply-templates select="html/body/h1"/>
            </Root>
        </xsl:template>
    
        <xsl:template match="p">
            <Paragraph>
                <xsl:value-of select="."/>
            </Paragraph>
        </xsl:template>
    
        <xsl:template match="h1 | h2 | h3 | h4 | h5 | h6">
            <Topic>
                <Title>
                    <xsl:value-of select="."/>
                </Title>
                <xsl:apply-templates select="key('immediate-nodes', generate-id())"/>
                <xsl:apply-templates select="key('next-headings', generate-id())"/>
            </Topic>
        </xsl:template>
    
    </xsl:stylesheet>