Search code examples
c++boost-spiritboost-graphboost-spirit-x3

Efficiently parse trivial files with boost spirit X3


I am a novice with C++ and Boost Spirit X3. For my project I parse a geo-social graph from two files with the following structure with boost spirit X3 into a boost graph.

I have a working implementation. As I don't have any prior experience with the libraries I wonder what you think about the approach and if you'd recommend to take a different approach.

In the graph file there is one line for each edge. While parsing the edges I have to create the nodes of the graph, in case the node has not been seen before. I use a semantic action that checks every time it encounters a node-id, if that node is already in the graph. Having read a full line I use a semantic action that then adds the edge.

In the location file there is one line per known location of a node at a given time. I store the first location that is known for a node in the graph (using a custom boost graph property).

I have to concrete questions but would be happy to receive any thoughts and suggestions:

  • Is it ok to use nested semantic actions as I do for the graph file? Does this hurt performance?
  • Is it recommended to parse the whole file at once with Spirit X3 or should I parse every line individually with Spirit X3?

Graph (denoting the edges in the graph)

[user1]     [user2]
0           3

Locations

[user]  [check-in time]         [latitude]      [longitude]     [location id]
0       2010-10-19T23:55:27Z    30.2359091167   -97.7951395833      22847

Spirit X3 parsing code

// Parse the gowalla edge file
boost::spirit::istream_iterator file_iterator(edge_file), eof;

x3::phrase_parse(file_iterator, eof,
        // Begin grammar
        (
         *((x3::int_[add_vertex] >> x3::int_[add_vertex])[add_edge])
        ),
        // End grammar
        x3::space
        );

// Fail if we couldn't parse the whole edges file
if (file_iterator != eof) {
    std::cerr << "Couldn't parse whole edges file" << std::endl;
}

// Parse the gowalla location file
file_iterator = boost::spirit::istream_iterator(location_file);

x3::phrase_parse(file_iterator, eof,
        // Begin grammar
        (
         // vertex_id   time of checkin       latitude  longitude             location id
         *((x3::int_ >> x3::lexeme[*x3::graph] >> x3::double_ >> x3::double_)[add_location] >> x3::int_ >> x3::eol)
        ),
        // End grammar
        x3::blank
        );

// Fail if we couldn't parse the whole location file
if (file_iterator != eof) {
    std::cerr << "Couldn't parse whole location file" << std::endl;
}

Semantic actions called by X3

// Lambda function that adds vertex to graph if not already added
auto add_vertex = [&](auto& ctx){
    // Return if the vertex is already known
    if (vertices.find(x3::_attr(ctx)) != vertices.end())    {
        return false;
    }

    // Otherwise add vertex to graph
    auto v = boost::add_vertex(g);

    // And add vertex descriptor to map
    vertices[x3::_attr(ctx)] = v;
};

// Lambda function that adds edge to graph
auto add_edge = [&](auto& ctx){
    // _attr(ctx) returns a boost fusion tuple
    auto attr = x3::_attr(ctx);

    // Add edge from the vertices returned from context
    boost::add_edge(vertices[fusion::at_c<0>(attr)],
            vertices[fusion::at_c<1>(attr)], g);
};

// Lambda function that adds locations to vertices in the graph
auto add_location = [&](auto& ctx){
    // _attr(ctx) returns a boost fusion tuple
    auto attr = x3::_attr(ctx);
    auto vertex_id = fusion::at_c<0>(attr);

    if (location_already_added.find(vertex_id) != location_already_added.end()) {
        // Exit, as we already stored the location for this vertex
        return true;
    }
    location_already_added.insert(vertex_id);

    // Test if vertex is in our graph
    // We are parsing locations from a different file than the graph,
    // so there might be inconsistencies
    if (vertices.find(vertex_id) == vertices.end()) {
        std::cerr << "Tried to add location to vertex " << vertex_id << ", but this vertex is not in our graph" << std::endl;
        return false;
    }

    auto vertex = vertices[vertex_id];

    // Add location to the vertex
    g[vertex].latitude = fusion::at_c<2>(attr);
    g[vertex].longitude = fusion::at_c<3>(attr);

    return true;
};

Boost graph

struct vertex_property {
    double longitude;
    double latitude;
};

// Define our graph
// We use setS to enforce our graph not to become a multigraph
typedef boost::adjacency_list<boost::setS, boost::vecS, boost::undirectedS, vertex_property, edge_property > graph;

Solution

  • Q. Is it ok to use nested semantic actions as I do for the graph file? Does this hurt performance?

    I wouldn't do it. It's probably much easier to just add the edges whole-sale:

    x3::parse(file_iterator, eof,
            *((x3::int_ >> '\t' >> x3::int_ >> x3::eol)[add_edge])
            );
    

    Where add_ege could be as simple as:

    auto add_edge = [&](auto& ctx){
        // Add edge from from context
        vertex_decriptor source, target;
        auto tup = std::tie(source, target);
    
        fusion::copy(x3::_attr(ctx), tup);
    
        boost::add_edge(map_vertex(source), map_vertex(target), g);
    };
    

    Q. Is it recommended to parse the whole file at once with Spirit X3 or should I parse every line individually with Spirit X3?

    I don't think spirit makes any recommendation. I'd do the whole file at once. And I recommend using memory mapped files so you get more efficiency (random-access iteration without multi_pass iterator adaptation).

    General Remarks:

    1. you are trying to use space-aware parsers but using them with istream_iterators. You must remember to reset the skipws flag on the stream then.

    2. the vertices map seems like a waste of resources; consider whether you can use the [user] thing (vertex_id) directly instead of translating to vertex_descriptor.

    Here's a cleaned up version that parses the files from https://snap.stanford.edu/data/loc-gowalla.html just fine in about 19s (that's considerably faster already):

    Live On Coliru

    #include <boost/fusion/adapted/std_tuple.hpp>
    #include <boost/graph/adjacency_list.hpp>
    #include <boost/spirit/home/x3.hpp>
    #include <boost/spirit/include/support_istream_iterator.hpp>
    #include <fstream>
    #include <iostream>
    
    namespace x3 = boost::spirit::x3;
    namespace fusion = boost::fusion;
    
    struct vertex_property {
        double longitude;
        double latitude;
    };
    
    struct edge_property { };
    
    struct Reader {
        bool read_edges(std::string fname) {
            // Lambda function that adds edge to graph
            auto add_edge = [this](auto& ctx){
                // Add edge from from context
                vertex_decriptor source, target;
                auto tup = std::tie(source, target);
    
                fusion::copy(x3::_attr(ctx), tup);
    
                boost::add_edge(this->map_vertex(source), this->map_vertex(target), g);
            };
    
            // Parse the gowalla edge file
            std::ifstream edge_file(fname);
            if (!edge_file) return false;
    
            boost::spirit::istream_iterator file_iterator(edge_file >> std::noskipws), eof;
    
            x3::parse(file_iterator, eof, *((x3::int_ >> '\t' >> x3::int_ >> x3::eol)[add_edge]));
    
            // Fail if we couldn't parse the whole edges file
            return (file_iterator == eof);
        }
    
        bool read_locations(std::string fname) {
            // Lambda function that adds locations to vertices in the graph
            auto add_location = [&](auto& ctx){
                // _attr(ctx) returns a boost fusion tuple
                auto attr = x3::_attr(ctx);
                auto vertex_id = fusion::at_c<0>(attr);
    
                if (!location_already_added.insert(vertex_id).second)
                    return true; // Exit, as we already stored the location for this vertex
    
                // Test if vertex is in our graph
                // We are parsing locations from a different file than the graph, so
                // there might be inconsistencies
                auto mapped = mapped_vertices.find(vertex_id);
                if (mapped == mapped_vertices.end()) {
                    std::cerr << "Tried to add location to vertex " << vertex_id << ", but this vertex is not in our graph" << std::endl;
                    return false;
                }
    
                // Add location to the vertex
                auto& props = g[mapped->second];
                props.latitude  = fusion::at_c<1>(attr);
                props.longitude = fusion::at_c<2>(attr);
    
                return true;
            };
    
            // Parse the gowalla location file
            std::ifstream location_file(fname);
            if (!location_file) return false;
    
            boost::spirit::istream_iterator file_iterator(location_file >> std::noskipws), eof;
    
            x3::parse(file_iterator, eof,
                    // [vertex_id]   [time of checkin]       [latitude]  [longitude]             [location] id
                    *((x3::int_ >> '\t' >> x3::omit[*x3::graph] >> '\t' >> x3::double_ >> '\t' >> x3::double_)[add_location] >> '\t' >> x3::int_ >> x3::eol)
                    );
    
            // Fail if we couldn't parse the whole location file
            return (file_iterator == eof);
        }
    
      private:
        // We use setS to enforce our graph not to become a multigraph
        typedef boost::adjacency_list<boost::setS, boost::vecS, boost::undirectedS, vertex_property, edge_property> graph;
        using vertex_decriptor = graph::vertex_descriptor;
    
        std::map<int, vertex_decriptor> mapped_vertices;
        std::set<int> location_already_added;
        graph g;
    
        // Lambda function that adds vertex to graph if not already added
        vertex_decriptor map_vertex(int id) {
            auto match = mapped_vertices.find(id);
    
            if (match != mapped_vertices.end())
                return match->second; // vertex already known
            else                      // Otherwise add vertex
                return mapped_vertices[id] = boost::add_vertex(g);
        };
    };
    
    int main() {
        Reader reader;
        if (!reader.read_edges("loc-gowalla_edges.txt"))
            std::cerr << "Couldn't parse whole edges file" << std::endl;
    
        if (!reader.read_locations("loc-gowalla_totalCheckins.txt"))
            std::cerr << "Couldn't parse whole location file" << std::endl;
    }
    

    Mapped Files

    For comparison, replacing with memory mapped files makes it MUCH faster: it completes in 3s (that's over 6x faster again):

    Live On Coliru

    Example changed fragment:

        boost::iostreams::mapped_file_source mm(fname);
        auto f = mm.begin(), l = mm.end();
        x3::parse(f, l, *((x3::int_ >> '\t' >> x3::int_ >> x3::eol)[add_edge]));
    

    Memory overhead

    After profiling. it looks like having the map/set is probably not too bad:

    enter image description here

    From what I see, the program uses 152MiB, of which only 4.1 show up as location_already_added at first glance.

    Reducing Memory Usage And Time

    Even so, replacing the set<int> location_already_added with a dynamic bitset and removing the map<int, vertex_descriptor> does further reduce memory usage as well as program run time.

    This time it completes in under 2s (another 33% off).

    It takes roughly 10% less memory for obvious reasons: 138.7 MiB.

    Live On Coliru

    Changes:

    #include <boost/fusion/adapted/std_tuple.hpp>
    #include <boost/graph/adjacency_list.hpp>
    #include <boost/spirit/home/x3.hpp>
    #include <boost/iostreams/device/mapped_file.hpp>
    #include <boost/dynamic_bitset.hpp>
    #include <fstream>
    #include <iostream>
    
    namespace x3 = boost::spirit::x3;
    namespace fusion = boost::fusion;
    
    struct vertex_property {
        double longitude;
        double latitude;
    };
    
    struct edge_property { };
    
    struct Reader {
        Reader() {
            g.m_vertices.reserve(1024);
        }
    
        bool read_edges(std::string fname) {
            // Lambda function that adds edge to graph
            auto add_edge = [this](auto& ctx){
                // Add edge from from context
                vertex_decriptor source, target;
                auto tup = std::tie(source, target);
    
                fusion::copy(x3::_attr(ctx), tup);
    
                boost::add_edge(this->map_vertex(source), this->map_vertex(target), g);
            };
    
            // Parse the gowalla edge file
            boost::iostreams::mapped_file_source mm(fname);
    
            auto f = mm.begin(), l = mm.end();
    
            x3::parse(f, l, *((x3::int_ >> '\t' >> x3::int_ >> x3::eol)[add_edge]));
    
            // Fail if we couldn't parse the whole edges file
            return f == l;
        }
    
        bool read_locations(std::string fname) {
            boost::dynamic_bitset<> location_already_added(num_vertices(g));
    
            // Lambda function that adds locations to vertices in the graph
            auto add_location = [&](auto& ctx){
                // _attr(ctx) returns a boost fusion tuple
                auto const& attr = x3::_attr(ctx);
                auto vertex_id = fusion::at_c<0>(attr);
    
                if (location_already_added.test(vertex_id))
                    return true; // Exit, as we already stored the location for this vertex
                location_already_added.set(vertex_id);
    
                // Test if vertex is in our graph
                // We are parsing locations from a different file than the graph, so
                // there might be inconsistencies
                auto mapped = this->mapped_vertex(vertex_id);
                if (graph::null_vertex() == mapped) {
                    std::cerr << "Tried to add location to vertex " << vertex_id << ", but this vertex is not in our graph" << std::endl;
                    return false;
                }
    
                // Add location to the vertex
                auto& props = g[mapped];
                props.latitude  = fusion::at_c<1>(attr);
                props.longitude = fusion::at_c<2>(attr);
    
                return true;
            };
    
            // Parse the gowalla location file
            std::ifstream location_file(fname);
            if (!location_file) return false;
    
            boost::iostreams::mapped_file_source mm(fname);
    
            auto f = mm.begin(), l = mm.end();
    
            x3::parse(f, l,
                    // [vertex_id]   [time of checkin]       [latitude]  [longitude]             [location] id
                    *((x3::int_ >> '\t' >> x3::omit[*x3::graph] >> '\t' >> x3::double_ >> '\t' >> x3::double_)[add_location] >> '\t' >> x3::int_ >> x3::eol)
                    );
    
            // Fail if we couldn't parse the whole location file
            return f == l;
        }
    
        typedef boost::adjacency_list<boost::setS, boost::vecS, boost::undirectedS, vertex_property, edge_property> graph;
      private:
        // We use setS to enforce our graph not to become a multigraph
        using vertex_decriptor = graph::vertex_descriptor;
    
        graph g;
    
    #if USE_VERTEX_DESCRIPTOR_MAPPING
        std::map<int, vertex_decriptor> mapped_vertices;
    
        vertex_decriptor map_vertex(int id) {
            auto match = mapped_vertices.find(id);
    
            if (match != mapped_vertices.end())
                return match->second; // vertex already known
            else                      // Otherwise add vertex
                return mapped_vertices[id] = boost::add_vertex(g);
        };
    
        vertex_decriptor mapped_vertex(int id) const {
            auto mapped = mapped_vertices.find(id);
    
            return mapped == mapped_vertices.end()
                ? return graph::null_vertex() 
                : mapped->second;
        }
    #else
        static vertex_decriptor map_vertex(int id) { return id; }
        static vertex_decriptor mapped_vertex(int id) { return id; }
    #endif
    };
    
    int main() {
        Reader reader;
        if (!reader.read_edges("loc-gowalla_edges.txt"))
            std::cerr << "Couldn't parse whole edges file" << std::endl;
    
        if (!reader.read_locations("loc-gowalla_totalCheckins.txt"))
            std::cerr << "Couldn't parse whole location file" << std::endl;
    }