Search code examples
c++xmlxml-parsingtinyxml

Blindly parse XML page for specific tags


I'm having trouble using TinyXML2 to blindly parse a XML page for specific tags.

Basically, I am asked to parse via C++ an HTML page. I use the (quite old) tidyHTML library to firstly "translate" my HTML pages into XML ones. Then, I want to use TinyXML2 to parse these newly created XML pages for specific tags' content (title, h1, meta keywords, ...).

To this end, I am trying to loop through all the tags in my XML page, using this code:

XMLDocument doc;
doc.Parse( cleanedHTML.c_str() );
XMLNode* currentNode;

if(currentNode->NoChildren())
{
    while(!currentNode->NextSibling())
    {
        currentNode=currentNode->Parent();
        if(!currentNode)
            return NULL;
    }
    currentNode=currentNode->NextSibling();
}
else
{
    currentNode=currentNode->FirstChild();
}

doc.Print();
std::string nodeName = currentNode->LastChild()->Value();
return nodeName;

There are probably a few things wrong with this code - no doubt, I'm clearly an amateur. But the result still puzzles me: nodeName is returning "USER=root" whatever the page I am parsing.

I tried selecting this node's related elements, like currentNode->FirstChildElement() or LastChildElement(), or even Siblings... But everytime it results in a Segmentation Fault which I cannot comprehend.

I've read that Xpath would be a good way to do what I'm trying to do, but then again I'm running out of time and I fear I wouldn't be able to wrap my mind around Xpath in such relatively short notice.

I probably am looking at all that the wrong way, or maybe should I be using Accept() ?
I honestly feel a bit lost here and would appreciate any help you guys would be so kind as to offer!
I'd like to quickly take this chance to also thank this website that has helped me so much in the past. Truely amazing.

Thanks by advance for your responses!


Solution

  • Now that I've finished my project I can finally answer this:

    What I was looking for indeed was Accept() and Visitors. I had to instantiate a Visitor, add any particular effect his "encounters" would produce, and throw it into my doc.Accept();

    For instance, if I wanted to get in a string the parsed page's title, I would do so:

    bool MyVisitor::VisitEnter(const XMLElement& element, const XMLAttribute* attribute) if(strcmp( element.Name(), "title") == 0) { if(element.GetText() != NULL) { titleContent = element.GetText(); } else titleContent = ""; }

    ... and then return it with a classic MyVisitor::getTitle() function that you'd call wherever you'd need it.
    Hope it helps, if anyone wants more details I can provide working & extended code.

    I've since discovered that Google released gumbo parser so... yeah.
    It's apparently both better & easier than using TinyXML-2 for parsing HTML5 nowadays :D