Search code examples
htmlxslttidy

xsltproc html documents


I'm trying to clean some htmls. I have converted them to xhtml with tidy

$ tidy -asxml -i -w 150 -o o.xml index.html

The resulting xhtml ends up having named entities. When trying xsltproc on those xhtmls, I keep getting errors.

$ xsltproc --novalid  -o out.htm  t.xsl o.xml
o.xml:873: parser error : Entity 'mdash' not defined
            resources to storing data and using permissions &mdash; as needed.</
                                                                   ^
o.xml:914: parser error : Entity 'uarr' not defined
        </div><a href="index.html#top" style="float:right">&uarr; Go to top</a>
                                                                 ^
o.xml:924: parser error : Entity 'nbsp' not defined
          Android 3.2&nbsp;r1 - 27 Jul 2011 12:18

If I add --html to the xsltproc it complains on a tag that has name and id attributes with same name (which is valid)

$ xsltproc --novalid --html -o out.htm  t.xsl o.xml o.xml:845: element a: validity error : ID top already defined
      <a name="top" id="top"></a>
                            ^

The xslt is simple:

<?xml version="1.0" encoding="ISO-8859-1"?>
<xsl:stylesheet version="1.0"
 xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="html" indent="yes" omit-xml-declaration="yes"/>

    <xsl:template match="node()|@*">
      <xsl:copy>
         <xsl:apply-templates select="node()|@*"/>
      </xsl:copy>
    </xsl:template>

    <xsl:template match="//*[@id=side-nav]"/>
</xsl:stylesheet>

Why doesn't --html work? Why is it complaining? Or should I forget it and fix the entities?


Solution

  • I am assuming that the unclearly stated question is this: I know how to avoid "Entity 'XXX' not defined" errors when running xsltproc (add --html). But how do I get rid of "ID YYY already defined"?

    Recent builds of Tidy have an anchor-as-name option. You can set it to "no" to remove unwanted name attributes:

    This option controls the deletion or addition of the name attribute in elements where it can serve as anchor. If set to "yes", a name attribute, if not already existing, is added along an existing id attribute if the DTD allows it. If set to "no", any existing name attribute is removed if an id attribute exists or has been added.