Search code examples
c++htmlxmlxercesxerces-c

Xerces-C: Parse Javascript inside of HTML


I want to parse websites for their meta tags. For this I use xerces-c.

shared_ptr<SAX2XMLReader> parser(XMLReaderFactory::createXMLReader());

//Create and set callback handler with the given callback functions
Handler handler(startElement,endElement,characters);
parser->setContentHandler(&handler);
parser->setErrorHandler(&handler);

//Parse the file with the given callback handler
parser->parse(filename.c_str());

Some websites now have javascript on it. Inside of the script tags javascript uses the operator && for logical and.

Xerces-C interprets this as entity reference (for example &nbsp) and throws an exception, because it doesn't know the entity reference &&.

Is there a way to read this correctly as text?

Or if not - is there a way to just ignore all characters inside of script tags? I don't need them anyway. I just want to parse the meta tags.


Solution

  • Basically, html is not necessarily well-formed xml, but you can, for instance, preprocess it with tidy before feeding to xml parser.