Search code examples
xmlxpathwebharvestnon-well-formed

WebHarvest XML not well formed


I am using WebHarvest to try to receive data from Woot.com and I'm getting a few different errors. I am able to get the website with the first process, but when I try to test xpath inside of the variable window I get the error org.xml.sax.SAXParseException; lineNumber: 86; columnNumber: 99; The reference to entity "pt2" must end with the ';' delimiter. If I try to use the pretty print function it returns XML is not well-formed: the reference to entity "pt2" must end with the ';' delimiter. {line: 86, col:99]. Lastly, Inside of the script I am writing, if I put in the xpath tag with an expression, I get element type "xpath" must be followed by either attributespecifications,">" or "/>". Can someone tell me what I am doing wrong? I am very new to WebHarvest and don't have any experience with this kind of program.

My code is:

<?xml version="1.0" encoding="UTF-8"?><config>
<xpath expression="(//div[@class="overview"])[1]//h2/text()">
<html-to-xml>
<http url="http://www.woot.com/"/>
</html-to-xml>
</xpath>
</config>

Solution

  • To make the XML well-formed you have use &apos; instead of &quot; within the attribute expression. And here it goes:

    <?xml version="1.0" encoding="UTF-8"?><config>
    <xpath expression="(//div[@class='overview'])[1]//h2/text()">
    <html-to-xml>
    <http url="http://www.woot.com/"/>
    </html-to-xml>
    </xpath>
    </config>
    

    You could use &apos; or &quot; to wrap an attribute. But, it cannot be nested anyway. Here are few examples:

     <xpath expression='(//div[@class="overview"])[1]//h2/text()'>           --- valid
     <xpath expression='(//div[@class='overview'])[1]//h2/text()'>           --- invalid
     <xpath expression="(//div[@class="overview"])[1]//h2/text()">           --- invalid
     <xpath expression='(//div[@class=&apos;overview&apos;])[1]//h2/text()'> --- valid
     <xpath expression="(//div[@class=&apos;overview&apos;])[1]//h2/text()"> --- valid
     <xpath expression="(//div[@class=&quot;overview&quot;])[1]//h2/text()"> --- valid
    

    Hope this helps.