I decided to use libxml2
parser for my qt application and im stuck on xpath
expressions. I found an example class and methods, and modified this a bit for my needs. The code
QStringList* LibXml2Reader::XPathParsing(QXmlInputSource input)
{
xmlInitParser();
xmlDocPtr doc;
xmlXPathContextPtr xpathCtx;
xmlXPathObjectPtr xpathObj;
QStringList *valList =NULL;
QByteArray arr = input.data().toUtf8(); //convert input data to utf8
int length = arr.length();
const char* data = arr.data();
doc = xmlRecoverMemory(data,length); // build a tree, ignoring the errors
if(doc == NULL) { return NULL;}
xpathCtx = xmlXPathNewContext(doc);
if(xpathCtx == NULL)
{
xmlFreeDoc(doc);
xmlCleanupParser();
return NULL;
}
xpathObj = xmlXPathEvalExpression(BAD_CAST "//[@class='b-domik__nojs']", xpathCtx); //heres the parsing fails
if(xpathObj == NULL)
{
xmlXPathFreeContext(xpathCtx);
xmlFreeDoc(doc);
xmlCleanupParser();
return NULL;
}
xmlNodeSetPtr nodes = xpathObj->nodesetval;
int size = (nodes) ? nodes->nodeNr : 0;
if(size==0)
{
xmlXPathFreeContext(xpathCtx);
xmlFreeDoc(doc);
xmlCleanupParser();
return NULL;
}
valList = new QStringList();
for (int i = 0; i < size; i++)
{
xmlNodePtr current = nodes->nodeTab[i];
const char* str = (const char*)current->content;
qDebug() << "name: " << QString::fromLocal8Bit((const char*)current->name);
qDebug() << "content: " << QString::fromLocal8Bit((const char*)current->content) << "\r\n";
valList->append(QString::fromLocal8Bit(str));
}
xmlXPathFreeObject(xpathObj);
xmlXPathFreeContext(xpathCtx);
xmlFreeDoc(doc);
xmlCleanupParser();
return valList;
}
As an example im making a request to http://yandex.ru/ and trying to get the node with class b-domik__nojs
which is basically one div.
xpathObj = xmlXPathEvalExpression(BAD_CAST "//[@class='b-domik__nojs']", xpathCtx); //heres the parsing fails
the problem is the expression //[@class='b-domik__nojs']
doesn't work at all. I checked it in firefox xpath
ext., and in opera developer tools xpath
ext. in there this expression works perfectly.
I also tried to get other nodes with attributes but for some reason xpath
for ANY attribute fails. Is there something wrong in my method? Also when i load a tree using xmlRecover
, it gives me a lot of parser errors in debug output.
Ok i played a bit with my libxml2
function more and used "//*"
expression to get all elements in the document, but! It returns me only the elements in the first children node of the body tag. This is the yandex.ru dom tree
so basically it gets ALL the elements in the first div "div class="b-line b-line_bar"
, but doesnt look for the other elements in other child nodes of the <body>
for some reason.
Why can that happen? Maybe xmlParseMemory
doesnt build a full tree for some reason? Is there any possible solution to fix this.
Allright it works now, if my mistake was to use xml functions to make html documents into a tree. I used htmlReadMemory and the tree is fully built now. Some code again
xmlInitParser();
xmlDocPtr doc;
xmlXPathContextPtr xpathCtx;
xmlXPathObjectPtr xpathObj;
QByteArray arr = input.data().toUtf8();
int length = arr.length();
const char* data = arr.data();
doc = htmlReadMemory(data,length,"",NULL,HTML_PARSE_RECOVER);
if(doc == NULL) { return NULL;}
xpathCtx = xmlXPathNewContext(doc);
if(xpathCtx == NULL)
{
xmlFreeDoc(doc);
xmlCleanupParser();
return NULL;
}
xpathObj = xmlXPathEvalExpression(BAD_CAST "//*[@class='b-domik__nojs']", xpathCtx);
etc.