Search code examples
c++graphboost-graph

Why a graph (about 10k to 100k edges and vertices) takes very long time to delete?


I'm writing a program to deal with a graph. I defined a graph in a loop and added vertices and edges to it (about 4k vertices and 40k edges). But when one round of loop ends, it takes a very long time to start the next round of loop.

typedef adjacency_list <listS, vecS, bidirectionalS, property<vertex_name_t, LineLabel>> LineGraph
for (int i = 0; i < num_comp; i++) {
        if (num_vertices(subgraphs[i]) == 1) {
            continue;
        }
        cout << "Subgraph " << i + 1 << ":" << endl;
        cout << "has " << num_vertices(subgraphs[i]) << " vertices" << endl;
        LineGraph llineg;
        get_line_graph(subgraphs[i], llineg);//vertices and edges are added here
    }

When debuging, it seems that when a loop ends, the program will jump to a file named scoped_ptr.hpp, and runs

~scoped_ptr() BOOST_SP_NOEXCEPT
    {
#if defined(BOOST_SP_ENABLE_DEBUG_HOOKS)
        boost::sp_scalar_destructor_hook( px );
#endif
        boost::checked_delete( px );
    }

Then it gose to checked_delete.hpp and runs

template<class T> inline void checked_delete(T * x) BOOST_NOEXCEPT
{
    // intentionally complex - simplification causes regressions
    typedef char type_must_be_complete[ sizeof(T)? 1: -1 ];
    (void) sizeof(type_must_be_complete);
    delete x;
}

I guess there are to much edges in the graph, so it takes a long time to remove them from memory.

What shoule I do to make it faster?

I made a small test about the question. The graph has 10617 vertices and 72172 edges.

int main() {
    typedef adjacency_list<setS, vecS, bidirectionalS> G;
    timer t;
    for (int j = 0; j < 2; j++) {
        cout << t.elapsed() << endl;
        int num_v = 10617;
        int num_e = 72172;
        G g;
        for (int i = 0; i < num_v; i++) {
            add_vertex(g);
        }
        ifstream infile("wordassociation-2011.txt"); //read edges from file
        int a, b;
        char c;
        int i = 0;
        for (int j = 0; j < num_a; ++j) {
            infile >> a >> c >> b;
            add_edge(a, b, g);
            i++;
            if (i % 5000 == 0) {
                cout << "reading..." << endl;
            }
        }
        cout << t.elapsed() << endl;
    }
}

The result shows that it takes about 2min to start next loop.picture of result 0 reading... reading... reading... reading... reading... reading... reading... reading... reading... reading... reading... reading... reading... reading... 0.981 122.192 reading... reading...


Solution

  • So, I can't leave well enough alone :)

    I downloaded the datasets from http://w3.usf.edu/FreeAssociation/AppendixA/index.html (clearly that's your source, because the counts match up exactly).

    I wrote a parser, which reads into Row datatype:

    // part of speech
    enum POS { Noun, Verb, Adjective, Adverb, Pronoun, Preposition, Interjection, Conjunction, Other };
    
    using Atom = AtomPool::Atom;
    
    struct Association {
        Atom     cue, target;
        bool     normed_;
        unsigned _g;   // norm_group
        unsigned _p;   // partic_response
        double   fsg;  // forward_strength
        double   bsg;  // backward_strength
        double   msg;  // mediated_strength "2-step strength"
        double   osg;  // overlapping_strength
        unsigned _m;   // mediators
        unsigned mmia; // mediators_missing_in_action
        double   _o;   // overlapping_associates shared by cue and target
        unsigned omia; // overlapping_missing_in_action
    };
    
    struct WordAttributes {
        unsigned ss;  // set size
        unsigned fr;  // frequency
        double   con; // concreteness
        bool     h;   // is homograph?
        POS      ps;  // part of speech
        double   mc;  // mean connectivity among its associates
        double   pr;  // probablity of a resonant connection
        double   rsg; // resonant strength
        unsigned uc;  // use code
    };
    
    struct Row {
        Association    assoc;
        WordAttributes qattr, tattr;
    };
    

    I optimized a little by using string atoms because strings might be reused a lot. And I was right (see statistics printed by the demos below).

    Just The Parser

    Live On Coliru

    // #define BOOST_SPIRIT_X3_DEBUG
    #include <boost/container/flat_set.hpp>
    #include <boost/iostreams/device/mapped_file.hpp>
    #include <boost/spirit/home/x3.hpp>
    
    #include <iomanip>
    #include <iostream>
    
    #include <chrono>
    #include <iomanip>
    #include <iostream>
    #include <utility>
    
    namespace {
        using boost::container::flat_set;
    
        static long elapsed() {
            auto now = std::chrono::high_resolution_clock::now;
            using namespace std::chrono_literals;
            static auto start = now();
            auto        t     = now();
    
            return (t - std::exchange(start, t)) / 1ms;
        }
    
        void trace(auto const&... args) {
            ((std::cout << std::setw(10) << elapsed() << "ms ") << ... << args) << std::endl;
        }
    } // namespace
    
    struct AtomPool {
        using Atom = std::uint32_t; // Backing::size_type;
        static constexpr Atom max() { return std::numeric_limits<Atom>::max(); }
    
        size_t size() const { return _index.size(); }
        size_t size_bytes() const { return _backing.size(); }
    
        AtomPool() {
            _backing.reserve(100'000);
            _index.reserve(12'000);
        }
    
        ~AtomPool() {
            _index.clear();
            _backing.clear();
            if (_insertions)
                trace("~AtomPool deduplicated ", (_dedup * 100.0 / _insertions), "% of ", _insertions,
                      " insertions");
        }
    
      private:
        using Backing = std::vector<char>;
    
        struct Cmp {
            AtomPool const& _pool;
            using is_transparent = void;
    
            template <typename... T> bool operator()(T const&... v) const { return (view(v) < ...); }
    
          private:
            std::string_view view(Atom id) const { return _pool.lookup(id); }
            std::string_view view(std::string_view sv) const { return sv; }
        };
    
        Backing             _backing;
        flat_set<Atom, Cmp> _index{Cmp{*this}};
        size_t              _insertions = 0, _dedup = 0;
    
      public:
        Atom atomize(std::string const& text) {
            if (_backing.size() > max())
                throw std::range_error("AtomPool");
    
            _insertions += 1;
    
            if (auto it = _index.find(text); it != _index.end()) {
                _dedup += 1;
                return *it;
            }
    
            auto pos = _backing.size();
            _backing.insert(_backing.end(), begin(text), end(text));
            _backing.push_back('\0');
    
            _index.insert(pos);
            return pos;
        }
    
        std::string_view lookup(Atom id) const {
            assert(id < _backing.size());
            return _backing.data() + id;
        }
    };
    
    namespace DataSet {
    
        /*
         * |--------------+-----------------------------------------------------|
         * | Abbreviation | Equivalence                                         |
         * |==============+=====================================================|
         * | CUE          | Normed Word                                         |
         * | TARGET       | Response to Normed Word                             |
         * | NORMED?      | Is Response Normed?                                 |
         * | #G           | Group size                                          |
         * | #            | P  Number of Participants Producing Response        |
         * | FSG          | Forward Cue-to-Target Strength                      |
         * | BSG          | Backward Target-to-Cue Strength                     |
         * | MSG          | Mediated Strength                                   |
         * | OSG          | Overlapping Associate Strength                      |
         * | #            | M  Number of Mediators                              |
         * | MMIA         | Number of Non-Normed Potential Mediating Associates |             
         * | #            | O  Number of Overlaping Associates                  |
         * | OMIA         | Number of Non-Normed Overlapping Associates         |
         * | QSS          | Cue: Set Size                                       |
         * | QFR          | Cue: Frequency                                      |
         * | QCON         | Cue: Concreteness                                   |
         * | QH           | Cue is a Homograph?                                 |
         * | QPS          | Cue: Part of Speech                                 |
         * | QMC          | Cue: Mean Connectivity Among Its Associates         |
         * | QPR          | Cue: Probability of a Resonant Connection           |
         * | QRSG         | Cue: Resonant Strength                              |
         * | QUC          | Cue: Use Code                                       |
         * | TSS          | Target: Set Size                                    |
         * | TFR          | Target: Frequency                                   |
         * | TCON         | Target: Concreteness                                |
         * | TH           | Target is a Homograph?                              |
         * | TPS          | Target: Part of Speech                              |
         * | TMC          | Target: Mean Connectivity Among Its Assoc.          |
         * | TPR          | Target: Probability of a Resonant Connection        |
         * | TRSG         | Target: Resonant Strength                           |
         * | TUC          | Target: Use Code                                    |
         * |--------------+-----------------------------------------------------|
         */
        enum ColId {
            CUE, TARGET, NORMED_, _G, _P, FSG, BSG, MSG, OSG, _M, MMIAS, _O, OMIAS, QSS, //
            QFR, QCON, QH, QPS, QMC, QPR, QRSG, QUC, TSS, TFR, TCON, TH, TPS, TMC, TPR, //
            TRSG, TUC, //
            NUMCOLUMNS
        };
    
        // part of speech
        enum POS { Noun, Verb, Adjective, Adverb, Pronoun, Preposition, Interjection, Conjunction, Other };
    
        using Atom = AtomPool::Atom;
    
        struct Association {
            Atom     cue, target;
            bool     normed_;
            unsigned _g;   // norm_group
            unsigned _p;   // partic_response
            double   fsg;  // forward_strength
            double   bsg;  // backward_strength
            double   msg;  // mediated_strength "2-step strength"
            double   osg;  // overlapping_strength
            unsigned _m;   // mediators
            unsigned mmia; // mediators_missing_in_action
            double   _o;   // overlapping_associates shared by cue and target
            unsigned omia; // overlapping_missing_in_action
        };
    
        struct WordAttributes {
            unsigned ss;  // set size
            unsigned fr;  // frequency
            double   con; // concreteness
            bool     h;   // is homograph?
            POS      ps;  // part of speech
            double   mc;  // mean connectivity among its associates
            double   pr;  // probablity of a resonant connection
            double   rsg; // resonant strength
            unsigned uc;  // use code
        };
    
        struct Row {
            Association    assoc;
            WordAttributes qattr, tattr;
        };
    } // namespace DataSet
      //
    #include <boost/fusion/adapted.hpp>
    
    BOOST_FUSION_ADAPT_STRUCT(DataSet::WordAttributes, ss, fr, con, h, ps, mc, pr, rsg, uc)
    BOOST_FUSION_ADAPT_STRUCT(DataSet::Association, cue, target, normed_, _g, _p, fsg, bsg, msg, osg, _m, mmia, _o, omia)
    BOOST_FUSION_ADAPT_STRUCT(DataSet::Row, assoc, qattr, tattr)
    
    struct Reader {
        using Atom        = DataSet::Atom;
        using Association = DataSet::Association;
        using Attributes  = DataSet::WordAttributes;
        using POS         = DataSet::POS;
    
        using Row  = DataSet::Row;
        using File = std::vector<Row>;
    
        Reader(std::string const& fname, AtomPool& pool) : _atoms(pool), _mfs{fname} { parse(); }
    
      private:
        AtomPool&                            _atoms;
        boost::iostreams::mapped_file_source _mfs;
        File                                 _rows;
    
        using It = char const*;
    
        void parse() try {
            It const f = _mfs.begin(), l = _mfs.end();
    
            long const total_bytes = l - f;
            _rows.reserve(total_bytes / 150);
            namespace x3  = boost::spirit::x3;
            namespace enc = x3::standard; // iso8859_1;
            using enc::blank;
            using enc::char_;
            using enc::lit;
            using enc::space;
            // using enc::symbols;
    
            auto atom = [this](auto& ctx) {
                auto& vec = _attr(ctx);
                _val(ctx) = _atoms.atomize(std::string(vec.begin(), vec.end()));
            };
    
            auto normed = x3::symbols<bool>{}.add("NO", false)("YES", true).sym;
            normed.name("YES|NO");
    
            auto col = x3::rule<Atom, Atom>{"col"} = x3::lexeme[*~char_(",\r\n")][atom];
    
            auto header = x3::eps   //
                >> "CUE" >> ','     //
                >> "TARGET" >> ','  //
                >> "NORMED?" >> ',' //
                >> "#G" >> ','      //
                >> "#P" >> ','      //
                >> "FSG" >> ','     //
                >> "BSG" >> ','     //
                >> "MSG" >> ','     //
                >> "OSG" >> ','     //
                >> "#M" >> ','      //
                >> "MMIAS" >> ','   //
                >> "#O" >> ','      //
                >> "OMIAS" >> ','   //
                >> "QSS" >> ','     //
                >> "QFR" >> ','     //
                >> "QCON" >> ','    //
                >> "QH" >> ','      //
                >> "QPS" >> ','     //
                >> "QMC" >> ','     //
                >> "QPR" >> ','     //
                >> "QRSG" >> ','    //
                >> "QUC" >> ','     //
                >> "TSS" >> ','     //
                >> "TFR" >> ','     //
                >> "TCON" >> ','    //
                >> "TH" >> ','      //
                >> "TPS" >> ','     //
                >> "TMC" >> ','     //
                >> "TPR" >> ','     //
                >> "TRSG" >> ','    //
                >> "TUC" >> x3::eol;
    
            auto yen = lit('\xa5') | x3::iso8859_1::lit("¥");
    
            auto eoc = &(',' | x3::eol | x3::eoi); // end-of-column predicate
    
            auto strength            //
                = x3::double_ >> eoc //
                | -yen >> x3::attr(NAN);
    
            auto count                                    //
                = x3::uint_ >> &(',' | x3::eol | x3::eoi) //
                | -yen >> x3::attr(-1u);
    
            auto homograph = x3::matches[enc::graph - ','];
            auto part_of_speech = //
                x3::symbols<POS>{}
                    .add("N", POS::Noun) //
                ("V", POS::Verb)         //
                ("AJ", POS::Adjective)   //
                ("AD", POS::Adverb)      //
                ("P", POS::Pronoun)      //
                ("PP", POS::Preposition) //
                ("I", POS::Interjection) //
                ("C", POS::Conjunction)  //
                // these were undocumented but deduced from context
                ("INT", POS::Interjection) //
                ("A", POS::Adverb)         //
                ("AV", POS::Adverb)        //
                ("ADV", POS::Adverb)       //
                ("ADJ", POS::Adjective)    //
                ("PRP", POS::Pronoun)      // "ACCOUNT" probably mistagged in dataset
                    .sym |
                eoc >> x3::attr(POS::Other);
    
            auto attributes = x3::rule<Attributes, Attributes>{"attributes"} = x3::eps
    
                > count > ','          // SS
                > count > ','          // FR
                > strength > ','       // CON
                > homograph > ','      // H
                > part_of_speech > ',' // PS
                > strength > ','       // M
                > strength > ','       // PR
                > strength > ','       // RSG
                > count;               // UC
    
            auto association = x3::rule<Association, Association>{"association"} = (*header >> !x3::eoi)
    
                > col > ','      // CUE
                > col > ','      // TARGET
                > normed > ','   // NORMED?
                > count > ','    // #G
                > count > ','    // #P
                > strength > ',' // FSG
                > strength > ',' // BSG
                > strength > ',' // MSG
                > strength > ',' // OSG
                > count > ','    // #M
                > count > ','    // MMIAS
                > strength > ',' // #O
                > count;         // OMIAS
    
            auto row = x3::rule<Row, Row>{"row"} = //
                (*header >> !x3::eoi)              //
                > association > ','                // assoc
                > attributes > ','                 // qattr
                > attributes;                      // tattr
    
            auto file = x3::rule<File, File>{"file"} = *(row >> x3::eol);
    
            phrase_parse(f, l, x3::eps > file > x3::eoi, blank, _rows);
    
            using DataSet::NUMCOLUMNS;
            trace("Parsed: ", total_bytes / 1024.0 / 1024, "MiB ", "into ", _rows.size(), " rows");
    
            trace(_atoms.size(), " atoms totaling ", _atoms.size_bytes(), " bytes of backing storage");
        } catch (boost::spirit::x3::expectation_failure<It> const& ef) {
            It   b = _mfs.begin(), e = _mfs.end();
            auto w = ef.where();
            auto sol = b + std::string_view(b, w).find_last_of("\r\n") + 1;
    
            auto line = std::string_view(sol, e);
            line      = line.substr(0, line.find_first_of("\r\n"));
    
            auto l = 1 + std::count(b, w, '\n'), c = 1 + (w - sol);
            trace("Expected ", ef.which(), " at L:", l, ":", c, " at\n", //
                  "    ", line, "\n",                                    //
                  std::setw(c + 4), '^', "----- here");                  //
            std::exit(1);
        }
    };
    
    int main(int argc, char** argv) {
        AtomPool pool;
    
        trace("Start");
        {
            for (std::string fname : std::vector(argv+1, argv+argc))
                Reader r{fname, pool};
        }
        trace("Done");
    }
    

    Which prints, for the small demo dataset:

    ./a.out /Archive2/bd/de4002f5d4d3ee/main.cpp
         0ms Start
        54ms Parsed: 0.00769901MiB into 40 rows
         0ms 70 atoms totaling 462 bytes of backing storage
         0ms Done
         0ms ~AtomPool deduplicated 12.5% of 80 insertions
    

    Locally, the entire dataset is parsed in ~90ms:

    /build/sotest dataset/Cue_Target_Pairs.* dataset/full.txt 
    
         0ms Start
        14ms Parsed: 1.15371MiB into 8476 rows
         0ms 3732 atoms totaling 26643 bytes of backing storage
        13ms Parsed: 1.03647MiB into 7590 rows
         0ms 5340 atoms totaling 39238 bytes of backing storage
        19ms Parsed: 1.42218MiB into 10500 rows
         0ms 6901 atoms totaling 51720 bytes of backing storage
        15ms Parsed: 1.14505MiB into 8440 rows
         0ms 7917 atoms totaling 59821 bytes of backing storage
        14ms Parsed: 1.26198MiB into 9287 rows
         0ms 8709 atoms totaling 66346 bytes of backing storage
        15ms Parsed: 1.35513MiB into 9888 rows
         0ms 9456 atoms totaling 72769 bytes of backing storage
        14ms Parsed: 1.25216MiB into 9214 rows
         0ms 10055 atoms totaling 77640 bytes of backing storage
        15ms Parsed: 1.19052MiB into 8781 rows
         0ms 10617 atoms totaling 82271 bytes of backing storage
        86ms Parsed: 9.8172MiB into 72176 rows
         0ms 10617 atoms totaling 82271 bytes of backing storage
         0ms Done
         0ms ~AtomPool deduplicated 96.3225% of 288704 insertions
    

    Creating The Graph

    Let's create all (normed and unnormed) words as vertices and add all associations (input rows) as edges.

    #include <boost/graph/adjacency_list.hpp>
    using VertexProp = DataSet::WordAttributes;
    using EdgeProp   = DataSet::Association;
    using Graph      = boost::adjacency_list<boost::vecS, boost::vecS, boost::directedS, VertexProp, EdgeProp>;
    

    Implementation of buildGraph can be pretty straight-forward using a temporary mapping from atom to vertex descriptor:

    Graph buildGraph() const {
        using V = Graph::vertex_descriptor;
    
        Graph g;
        g.m_vertices.reserve(_atoms.size());
    
        std::map<Atom, V> vmap; // not flat-map for stability
    
        for (auto& [assoc, qattr, tattr] : _rows) {
            auto s = vmap.find(assoc.cue);
            auto t = vmap.find(assoc.target);
            if (s == vmap.end()) s = vmap.emplace(assoc.cue, add_vertex(qattr, g)).first;
            if (t == vmap.end()) t = vmap.emplace(assoc.target, add_vertex(tattr, g)).first;
    
            assert(g[s->second] == qattr); // check consistent input for qattr/tattr
            assert(g[t->second] == tattr);
    
            add_edge(s->second, t->second, assoc, g);
        }
    
        return g;
    }
    

    In our main let's build the graphs after parsing:

    int main(int argc, char** argv) {
        AtomPool pool;
    
        trace("Start");
        for (std::string fname : std::vector(argv + 1, argv + argc)) {
            Reader r{fname, pool};
            trace("Parsed ", fname);
            {
                Graph g = r.buildGraph();
                trace("Graph creation done; ", num_vertices(g), " vertices and ", num_edges(g), " edges");
            }
            trace("Graph destruction done");
        }
        trace("Done");
    }
    

    Live On Coliru

    ./a.out /Archive2/bd/de4002f5d4d3ee/main.cpp
         0ms Start
       322ms Parsed: 0.00769901MiB into 40 rows
         0ms Parsed /Archive2/bd/de4002f5d4d3ee/main.cpp
         0ms Graph creation done; 70 vertices and 40 edges
         0ms Graph destruction done
         0ms Done
         0ms ~AtomPool deduplicated 12.5% of 80 insertions
    

    For the full dataset, the output becomes:

         0ms Start
        93ms Parsed: 9.8172MiB into 72176 rows
         0ms Parsed dataset/full.txt
        16ms Graph creation done; 10617 vertices and 72176 edges
         1ms Graph destruction done
         0ms Done
         0ms ~AtomPool deduplicated 92.6451% of 144352 insertions
    

    NOTE

    There's four more edges than you reported in your question. Perhaps you haven't added the undocumented part-of-speech codes like I did?

    enter image description here

    As you can see, virtually no time is taken in destruction. This is even copying all the data from the Reader - so the Reader can be disposed after the graph has been created - also so the vertex/edge properties will be mutable.

    If you don't need that, you can make everything more memory-efficient using pointers instead.

    listS EdgeContainerSelector?

    I don't see a significant difference, though graph destruction now does indeed take 1ms longer:

         0ms Start
        90ms Parsed: 9.8172MiB into 72176 rows
         0ms Parsed dataset/full.txt
        13ms Graph creation done; 10617 vertices and 72176 edges
         2ms Graph destruction done
         0ms Done
         0ms ~AtomPool deduplicated 92.6451% of 144352 insertions
    

    bidirectionalS?

    This doesn't seem to make a ton of sense in the application domain¹. And it has significant storage and insertion overhead. However, under some access patterns those might pay-back in terms of improved query performance. YMMV.

    Compare Live or locally against the full data set:

         0ms Start
        90ms Parsed: 9.8172MiB into 72176 rows
         0ms Parsed dataset/full.txt
        17ms Graph creation done; 10617 vertices and 72176 edges
         6ms Graph destruction done
         0ms Done
         0ms ~AtomPool deduplicated 92.6451% of 144352 insertions
    

    As predicted, adding inefficiences does turn up in the timings. It's quite significant, although 6x slower destruction is still only 6ms.

    ¹ The correlation between forward and backward strength for cues whose targets have been normed is positive but not high, r = .29 (n = 63,619), and the chances of correctly guessing back strengths from knowledge of forward strengths are low