Search code examples
c++xmlxml-parsingpugixml

Return node information during comparison


I've had a bit of help to make this code. At the moment what the code does is print out the id numbers of the differences in the files, i.e with the new compared with the old what has been added, removed or stayed the same.

However what I want to do is return the information in the node when it only appears in new.xml, not just the ID (I.e title, location, date).

My best guess that I can find from Google is to use (with no idea how to implement): xpath->getAncestor

My current code

#include <set>
#include <string>
#include <sstream>
#include <iostream>
#include <algorithm>

#include "include/pugixml.hpp"

#define con(m) std::cout << m << '\n'
#define err(m) std::cerr << m << std::endl

using str_set = std::set<std::string>;

int main()
{
    pugi::xml_document doc;

    str_set a;
    doc.load_file("old.xml");

    // fill set a with just the ids from file a
    for(auto&& node: doc.child("site_entries").children("entry"))
        a.emplace(node.child("id").text().as_string());

    str_set b;
    doc.load_file("new.xml");

    // fill set b with just the ids from file b
    for(auto&& node: doc.child("site_entries").children("entry"))
        b.emplace(node.child("id").text().as_string());

    // now use the <algorithms> library

    str_set b_from_a;
    std::set_difference(a.begin(), a.end(), b.begin(), b.end()
        , std::inserter(b_from_a, b_from_a.begin()));

    str_set a_from_b;
    std::set_difference(b.begin(), b.end(), a.begin(), a.end()
        , std::inserter(a_from_b, a_from_b.begin()));

    str_set a_and_b;
    std::set_intersection(a.begin(), a.end(), b.begin(), b.end()
        , std::inserter(a_and_b, a_and_b.begin()));

    for(auto&& v: a)
        con("a       : " << v);

    con("");

    for(auto&& v: b)
        con("b       : " << v);

    con("");

    for(auto&& v: b_from_a)
        con("b_from_a: " << v);

    con("");

    for(auto&& v: a_from_b)
        con("a_from_b: " << v);

    con("");

    for(auto&& v: a_and_b)
        con("a_and_b : " << v);

    con("");
}

This is an example XML:

<?xml version="1.0" encoding="ISO-8859-1" ?> <site_entries> <entry> <id><![CDATA[946757316]]></id> <url><![CDATA[http://www.site.co.uk/cgi-bin/tr.cgi?tid=752276]]></url> <content><![CDATA[Specialized Dolce Sport 27 Speed]]></content> <title><![CDATA[Bike]]></title> <price><![CDATA[£600]]></price> <date><![CDATA[01-AUG-13]]></date> <display_reference><![CDATA[214683-50142933_370647]]></display_reference> <location><![CDATA[City of London]]></location> <category><![CDATA[Bike]]></category> </entry> <entry> <id><![CDATA[90007316]]></id> <url><![CDATA[http://www.site.co.uk/cgi-bin/tr.cgi?tid=70952276]]></url> <content><![CDATA[Giant Sport Offroad Bike]]></content> <title><![CDATA[Bike]]></title> <price><![CDATA[£100]]></price> <date><![CDATA[11-AUG-15]]></date> <display_reference><![CDATA[2146433-50142933_370647]]></display_reference> <location><![CDATA[City of London]]></location> <category><![CDATA[Bike]]></category> </entry> </site_entries>

I will have hundreds of thousands of total results and tens of thousands of added entries so I'm looking for the most efficient way of achieving this.


Solution

  • You can just put the xml_node objects into the map - instead of std::set<std::string> use std::map<std::string, pugi::xml_node>.

    It's possible/likely that using unordered_map will be faster for your case though. I would do something like this:

    #include "pugixml.hpp"
    
    #include <iostream>
    #include <unordered_map>
    
    struct string_hasher
    {
        unsigned int operator()(const char* str) const
        {
            // Jenkins one-at-a-time hash (http://en.wikipedia.org/wiki/Jenkins_hash_function#one-at-a-time)
            unsigned int result = 0;
    
            while (*str)
            {
                result += static_cast<unsigned int>(*str++);
                result += result << 10;
                result ^= result >> 6;
            }
    
            result += result << 3;
            result ^= result >> 11;
            result += result << 15;
    
            return result;
        }
    
        bool operator()(const char* lhs, const char* rhs) const
        {
            return strcmp(lhs, rhs) == 0;
        }
    };
    
    typedef std::unordered_map<const char*, pugi::xml_node, string_hasher, string_hasher> xml_node_map;
    
    int main()
    {
        pugi::xml_document doca, docb;
        xml_node_map mapa, mapb;
    
        if (!doca.load_file("a.xml") || !docb.load_file("b.xml"))
            return 1;
    
        for (auto& node: doca.child("site_entries").children("entry"))
            mapa[node.child_value("id")] = node;
    
        for (auto& node: docb.child("site_entries").children("entry"))
            mapb[node.child_value("id")] = node;
    
        for (auto& ea: mapa)
            if (mapb.count(ea.first) == 0)
            {
                std::cout << "Removed:" << std::endl;
                ea.second.print(std::cout);
            }
    
        for (auto& eb: mapb)
            if (mapa.count(eb.first) == 0)
            {
                std::cout << "Added:" << std::endl;
                eb.second.print(std::cout);
            }
    }
    

    Notable differences from your approach:

    • unordered_map lets you reduce the complexity of the diff - it's now O(N+M), not O(NlogN + MlogM)
    • Custom hasher for C strings avoids allocating unnecessary memory

    Of course you can simplify by using std::unordered_map<std::string, pugi::xml_node> - it's likely to be slower, but shorter.