Search code examples
c++htmldomdocumentxerces

C++ Xerces Parser Load HTML and Search for HTML Elements


Im trying to load HTML with Xerces DOMDocument C++ Parser and search for specific HTML Elements. I am having trouble finding good examples on how to accomplish this. All I seem to find is parsing XML. Can someone assist? Thanks.


Solution

  • Take a look at this: http://xerces.apache.org/xerces-c/program-dom-3.html

    There is an example with DOMDocument as well:

    // // Create a small document tree //

    {
        XMLCh tempStr[100];
    
        XMLString::transcode("Range", tempStr, 99);
        DOMImplementation* impl = DOMImplementationRegistry::getDOMImplementation(tempStr, 0);
    
        XMLString::transcode("root", tempStr, 99);
        DOMDocument*   doc = impl->createDocument(0, tempStr, 0);
        DOMElement*   root = doc->getDocumentElement();
    
        XMLString::transcode("FirstElement", tempStr, 99);
        DOMElement*   e1 = doc->createElement(tempStr);
        root->appendChild(e1);
    
        XMLString::transcode("SecondElement", tempStr, 99);
        DOMElement*   e2 = doc->createElement(tempStr);
        root->appendChild(e2);
    
        XMLString::transcode("aTextNode", tempStr, 99);
        DOMText*       textNode = doc->createTextNode(tempStr);
        e1->appendChild(textNode);
    
        // optionally, call release() to release the resource associated with the range after done
        DOMRange* range = doc->createRange();
        range->release();
    
        // removedElement is an orphaned node, optionally call release() to release associated resource
        DOMElement* removedElement = root->removeChild(e2);
        removedElement->release();
    
        // no need to release this returned object which is owned by implementation
        XMLString::transcode("*", tempStr, 99);
        DOMNodeList*    nodeList = doc->getElementsByTagName(tempStr);
    
        // done with the document, must call release() to release the entire document resources
        doc->release();
    };
    

    ... and so long.

    EDIT:

    But how do I load HTML into the DOMDocument and search against the html elements? Thats what Im trying to figure out.

    XercesDOMParser parser;

    parser.loadGrammar("grammar.dtd", Grammar::DTDGrammarType);

    parser.setValidationScheme(XercesDOMParser::Val_Always);

    Handler handler;

    parser.setErrorHandler( &handler );

    parser.parse("xmlfile.xml");