Why a graph (about 10k to 100k edges and vertices) takes very long time to delete?

I'm writing a program to deal with a graph. I defined a graph in a loop and added vertices and edges to it (about 4k vertices and 40k edges). But when one round of loop ends, it takes a very long time to start the next round of loop.

typedef adjacency_list <listS, vecS, bidirectionalS, property<vertex_name_t, LineLabel>> LineGraph
for (int i = 0; i < num_comp; i++) {
        if (num_vertices(subgraphs[i]) == 1) {
            continue;
        }
        cout << "Subgraph " << i + 1 << ":" << endl;
        cout << "has " << num_vertices(subgraphs[i]) << " vertices" << endl;
        LineGraph llineg;
        get_line_graph(subgraphs[i], llineg);//vertices and edges are added here
    }

When debuging, it seems that when a loop ends, the program will jump to a file named scoped_ptr.hpp, and runs

~scoped_ptr() BOOST_SP_NOEXCEPT
    {
#if defined(BOOST_SP_ENABLE_DEBUG_HOOKS)
        boost::sp_scalar_destructor_hook( px );
#endif
        boost::checked_delete( px );
    }

Then it gose to checked_delete.hpp and runs

template<class T> inline void checked_delete(T * x) BOOST_NOEXCEPT
{
    // intentionally complex - simplification causes regressions
    typedef char type_must_be_complete[ sizeof(T)? 1: -1 ];
    (void) sizeof(type_must_be_complete);
    delete x;
}

I guess there are to much edges in the graph, so it takes a long time to remove them from memory.

What shoule I do to make it faster?

I made a small test about the question. The graph has 10617 vertices and 72172 edges.

int main() {
    typedef adjacency_list<setS, vecS, bidirectionalS> G;
    timer t;
    for (int j = 0; j < 2; j++) {
        cout << t.elapsed() << endl;
        int num_v = 10617;
        int num_e = 72172;
        G g;
        for (int i = 0; i < num_v; i++) {
            add_vertex(g);
        }
        ifstream infile("wordassociation-2011.txt"); //read edges from file
        int a, b;
        char c;
        int i = 0;
        for (int j = 0; j < num_a; ++j) {
            infile >> a >> c >> b;
            add_edge(a, b, g);
            i++;
            if (i % 5000 == 0) {
                cout << "reading..." << endl;
            }
        }
        cout << t.elapsed() << endl;
    }
}

The result shows that it takes about 2min to start next loop.picture of result 0 reading... reading... reading... reading... reading... reading... reading... reading... reading... reading... reading... reading... reading... reading... 0.981 122.192 reading... reading...

Solution

So, I can't leave well enough alone :)

I downloaded the datasets from http://w3.usf.edu/FreeAssociation/AppendixA/index.html (clearly that's your source, because the counts match up exactly).

I wrote a parser, which reads into Row datatype:

// part of speech
enum POS { Noun, Verb, Adjective, Adverb, Pronoun, Preposition, Interjection, Conjunction, Other };

using Atom = AtomPool::Atom;

struct Association {
    Atom     cue, target;
    bool     normed_;
    unsigned _g;   // norm_group
    unsigned _p;   // partic_response
    double   fsg;  // forward_strength
    double   bsg;  // backward_strength
    double   msg;  // mediated_strength "2-step strength"
    double   osg;  // overlapping_strength
    unsigned _m;   // mediators
    unsigned mmia; // mediators_missing_in_action
    double   _o;   // overlapping_associates shared by cue and target
    unsigned omia; // overlapping_missing_in_action
};

struct WordAttributes {
    unsigned ss;  // set size
    unsigned fr;  // frequency
    double   con; // concreteness
    bool     h;   // is homograph?
    POS      ps;  // part of speech
    double   mc;  // mean connectivity among its associates
    double   pr;  // probablity of a resonant connection
    double   rsg; // resonant strength
    unsigned uc;  // use code
};

struct Row {
    Association    assoc;
    WordAttributes qattr, tattr;
};

I optimized a little by using string atoms because strings might be reused a lot. And I was right (see statistics printed by the demos below).

Just The Parser

Live On Coliru

// #define BOOST_SPIRIT_X3_DEBUG
#include <boost/container/flat_set.hpp>
#include <boost/iostreams/device/mapped_file.hpp>
#include <boost/spirit/home/x3.hpp>

#include <iomanip>
#include <iostream>

#include <chrono>
#include <iomanip>
#include <iostream>
#include <utility>

namespace {
    using boost::container::flat_set;

    static long elapsed() {
        auto now = std::chrono::high_resolution_clock::now;
        using namespace std::chrono_literals;
        static auto start = now();
        auto        t     = now();

        return (t - std::exchange(start, t)) / 1ms;
    }

    void trace(auto const&... args) {
        ((std::cout << std::setw(10) << elapsed() << "ms ") << ... << args) << std::endl;
    }
} // namespace

struct AtomPool {
    using Atom = std::uint32_t; // Backing::size_type;
    static constexpr Atom max() { return std::numeric_limits<Atom>::max(); }

    size_t size() const { return _index.size(); }
    size_t size_bytes() const { return _backing.size(); }

    AtomPool() {
        _backing.reserve(100'000);
        _index.reserve(12'000);
    }

    ~AtomPool() {
        _index.clear();
        _backing.clear();
        if (_insertions)
            trace("~AtomPool deduplicated ", (_dedup * 100.0 / _insertions), "% of ", _insertions,
                  " insertions");
    }

  private:
    using Backing = std::vector<char>;

    struct Cmp {
        AtomPool const& _pool;
        using is_transparent = void;

        template <typename... T> bool operator()(T const&... v) const { return (view(v) < ...); }

      private:
        std::string_view view(Atom id) const { return _pool.lookup(id); }
        std::string_view view(std::string_view sv) const { return sv; }
    };

    Backing             _backing;
    flat_set<Atom, Cmp> _index{Cmp{*this}};
    size_t              _insertions = 0, _dedup = 0;

  public:
    Atom atomize(std::string const& text) {
        if (_backing.size() > max())
            throw std::range_error("AtomPool");

        _insertions += 1;

        if (auto it = _index.find(text); it != _index.end()) {
            _dedup += 1;
            return *it;
        }

        auto pos = _backing.size();
        _backing.insert(_backing.end(), begin(text), end(text));
        _backing.push_back('\0');

        _index.insert(pos);
        return pos;
    }

    std::string_view lookup(Atom id) const {
        assert(id < _backing.size());
        return _backing.data() + id;
    }
};

namespace DataSet {

    /*
     * |--------------+-----------------------------------------------------|
     * | Abbreviation | Equivalence                                         |
     * |==============+=====================================================|
     * | CUE          | Normed Word                                         |
     * | TARGET       | Response to Normed Word                             |
     * | NORMED?      | Is Response Normed?                                 |
     * | #G           | Group size                                          |
     * | #            | P  Number of Participants Producing Response        |
     * | FSG          | Forward Cue-to-Target Strength                      |
     * | BSG          | Backward Target-to-Cue Strength                     |
     * | MSG          | Mediated Strength                                   |
     * | OSG          | Overlapping Associate Strength                      |
     * | #            | M  Number of Mediators                              |
     * | MMIA         | Number of Non-Normed Potential Mediating Associates |             
     * | #            | O  Number of Overlaping Associates                  |
     * | OMIA         | Number of Non-Normed Overlapping Associates         |
     * | QSS          | Cue: Set Size                                       |
     * | QFR          | Cue: Frequency                                      |
     * | QCON         | Cue: Concreteness                                   |
     * | QH           | Cue is a Homograph?                                 |
     * | QPS          | Cue: Part of Speech                                 |
     * | QMC          | Cue: Mean Connectivity Among Its Associates         |
     * | QPR          | Cue: Probability of a Resonant Connection           |
     * | QRSG         | Cue: Resonant Strength                              |
     * | QUC          | Cue: Use Code                                       |
     * | TSS          | Target: Set Size                                    |
     * | TFR          | Target: Frequency                                   |
     * | TCON         | Target: Concreteness                                |
     * | TH           | Target is a Homograph?                              |
     * | TPS          | Target: Part of Speech                              |
     * | TMC          | Target: Mean Connectivity Among Its Assoc.          |
     * | TPR          | Target: Probability of a Resonant Connection        |
     * | TRSG         | Target: Resonant Strength                           |
     * | TUC          | Target: Use Code                                    |
     * |--------------+-----------------------------------------------------|
     */
    enum ColId {
        CUE, TARGET, NORMED_, _G, _P, FSG, BSG, MSG, OSG, _M, MMIAS, _O, OMIAS, QSS, //
        QFR, QCON, QH, QPS, QMC, QPR, QRSG, QUC, TSS, TFR, TCON, TH, TPS, TMC, TPR, //
        TRSG, TUC, //
        NUMCOLUMNS
    };

    // part of speech
    enum POS { Noun, Verb, Adjective, Adverb, Pronoun, Preposition, Interjection, Conjunction, Other };

    using Atom = AtomPool::Atom;

    struct Association {
        Atom     cue, target;
        bool     normed_;
        unsigned _g;   // norm_group
        unsigned _p;   // partic_response
        double   fsg;  // forward_strength
        double   bsg;  // backward_strength
        double   msg;  // mediated_strength "2-step strength"
        double   osg;  // overlapping_strength
        unsigned _m;   // mediators
        unsigned mmia; // mediators_missing_in_action
        double   _o;   // overlapping_associates shared by cue and target
        unsigned omia; // overlapping_missing_in_action
    };

    struct WordAttributes {
        unsigned ss;  // set size
        unsigned fr;  // frequency
        double   con; // concreteness
        bool     h;   // is homograph?
        POS      ps;  // part of speech
        double   mc;  // mean connectivity among its associates
        double   pr;  // probablity of a resonant connection
        double   rsg; // resonant strength
        unsigned uc;  // use code
    };

    struct Row {
        Association    assoc;
        WordAttributes qattr, tattr;
    };
} // namespace DataSet
  //
#include <boost/fusion/adapted.hpp>

BOOST_FUSION_ADAPT_STRUCT(DataSet::WordAttributes, ss, fr, con, h, ps, mc, pr, rsg, uc)
BOOST_FUSION_ADAPT_STRUCT(DataSet::Association, cue, target, normed_, _g, _p, fsg, bsg, msg, osg, _m, mmia, _o, omia)
BOOST_FUSION_ADAPT_STRUCT(DataSet::Row, assoc, qattr, tattr)

struct Reader {
    using Atom        = DataSet::Atom;
    using Association = DataSet::Association;
    using Attributes  = DataSet::WordAttributes;
    using POS         = DataSet::POS;

    using Row  = DataSet::Row;
    using File = std::vector<Row>;

    Reader(std::string const& fname, AtomPool& pool) : _atoms(pool), _mfs{fname} { parse(); }

  private:
    AtomPool&                            _atoms;
    boost::iostreams::mapped_file_source _mfs;
    File                                 _rows;

    using It = char const*;

    void parse() try {
        It const f = _mfs.begin(), l = _mfs.end();

        long const total_bytes = l - f;
        _rows.reserve(total_bytes / 150);
        namespace x3  = boost::spirit::x3;
        namespace enc = x3::standard; // iso8859_1;
        using enc::blank;
        using enc::char_;
        using enc::lit;
        using enc::space;
        // using enc::symbols;

        auto atom = [this](auto& ctx) {
            auto& vec = _attr(ctx);
            _val(ctx) = _atoms.atomize(std::string(vec.begin(), vec.end()));
        };

        auto normed = x3::symbols<bool>{}.add("NO", false)("YES", true).sym;
        normed.name("YES|NO");

        auto col = x3::rule<Atom, Atom>{"col"} = x3::lexeme[*~char_(",\r\n")][atom];

        auto header = x3::eps   //
            >> "CUE" >> ','     //
            >> "TARGET" >> ','  //
            >> "NORMED?" >> ',' //
            >> "#G" >> ','      //
            >> "#P" >> ','      //
            >> "FSG" >> ','     //
            >> "BSG" >> ','     //
            >> "MSG" >> ','     //
            >> "OSG" >> ','     //
            >> "#M" >> ','      //
            >> "MMIAS" >> ','   //
            >> "#O" >> ','      //
            >> "OMIAS" >> ','   //
            >> "QSS" >> ','     //
            >> "QFR" >> ','     //
            >> "QCON" >> ','    //
            >> "QH" >> ','      //
            >> "QPS" >> ','     //
            >> "QMC" >> ','     //
            >> "QPR" >> ','     //
            >> "QRSG" >> ','    //
            >> "QUC" >> ','     //
            >> "TSS" >> ','     //
            >> "TFR" >> ','     //
            >> "TCON" >> ','    //
            >> "TH" >> ','      //
            >> "TPS" >> ','     //
            >> "TMC" >> ','     //
            >> "TPR" >> ','     //
            >> "TRSG" >> ','    //
            >> "TUC" >> x3::eol;

        auto yen = lit('\xa5') | x3::iso8859_1::lit("¥");

        auto eoc = &(',' | x3::eol | x3::eoi); // end-of-column predicate

        auto strength            //
            = x3::double_ >> eoc //
            | -yen >> x3::attr(NAN);

        auto count                                    //
            = x3::uint_ >> &(',' | x3::eol | x3::eoi) //
            | -yen >> x3::attr(-1u);

        auto homograph = x3::matches[enc::graph - ','];
        auto part_of_speech = //
            x3::symbols<POS>{}
                .add("N", POS::Noun) //
            ("V", POS::Verb)         //
            ("AJ", POS::Adjective)   //
            ("AD", POS::Adverb)      //
            ("P", POS::Pronoun)      //
            ("PP", POS::Preposition) //
            ("I", POS::Interjection) //
            ("C", POS::Conjunction)  //
            // these were undocumented but deduced from context
            ("INT", POS::Interjection) //
            ("A", POS::Adverb)         //
            ("AV", POS::Adverb)        //
            ("ADV", POS::Adverb)       //
            ("ADJ", POS::Adjective)    //
            ("PRP", POS::Pronoun)      // "ACCOUNT" probably mistagged in dataset
                .sym |
            eoc >> x3::attr(POS::Other);

        auto attributes = x3::rule<Attributes, Attributes>{"attributes"} = x3::eps

            > count > ','          // SS
            > count > ','          // FR
            > strength > ','       // CON
            > homograph > ','      // H
            > part_of_speech > ',' // PS
            > strength > ','       // M
            > strength > ','       // PR
            > strength > ','       // RSG
            > count;               // UC

        auto association = x3::rule<Association, Association>{"association"} = (*header >> !x3::eoi)

            > col > ','      // CUE
            > col > ','      // TARGET
            > normed > ','   // NORMED?
            > count > ','    // #G
            > count > ','    // #P
            > strength > ',' // FSG
            > strength > ',' // BSG
            > strength > ',' // MSG
            > strength > ',' // OSG
            > count > ','    // #M
            > count > ','    // MMIAS
            > strength > ',' // #O
            > count;         // OMIAS

        auto row = x3::rule<Row, Row>{"row"} = //
            (*header >> !x3::eoi)              //
            > association > ','                // assoc
            > attributes > ','                 // qattr
            > attributes;                      // tattr

        auto file = x3::rule<File, File>{"file"} = *(row >> x3::eol);

        phrase_parse(f, l, x3::eps > file > x3::eoi, blank, _rows);

        using DataSet::NUMCOLUMNS;
        trace("Parsed: ", total_bytes / 1024.0 / 1024, "MiB ", "into ", _rows.size(), " rows");

        trace(_atoms.size(), " atoms totaling ", _atoms.size_bytes(), " bytes of backing storage");
    } catch (boost::spirit::x3::expectation_failure<It> const& ef) {
        It   b = _mfs.begin(), e = _mfs.end();
        auto w = ef.where();
        auto sol = b + std::string_view(b, w).find_last_of("\r\n") + 1;

        auto line = std::string_view(sol, e);
        line      = line.substr(0, line.find_first_of("\r\n"));

        auto l = 1 + std::count(b, w, '\n'), c = 1 + (w - sol);
        trace("Expected ", ef.which(), " at L:", l, ":", c, " at\n", //
              "    ", line, "\n",                                    //
              std::setw(c + 4), '^', "----- here");                  //
        std::exit(1);
    }
};

int main(int argc, char** argv) {
    AtomPool pool;

    trace("Start");
    {
        for (std::string fname : std::vector(argv+1, argv+argc))
            Reader r{fname, pool};
    }
    trace("Done");
}

Which prints, for the small demo dataset:

./a.out /Archive2/bd/de4002f5d4d3ee/main.cpp
     0ms Start
    54ms Parsed: 0.00769901MiB into 40 rows
     0ms 70 atoms totaling 462 bytes of backing storage
     0ms Done
     0ms ~AtomPool deduplicated 12.5% of 80 insertions

Locally, the entire dataset is parsed in ~90ms:

/build/sotest dataset/Cue_Target_Pairs.* dataset/full.txt 

     0ms Start
    14ms Parsed: 1.15371MiB into 8476 rows
     0ms 3732 atoms totaling 26643 bytes of backing storage
    13ms Parsed: 1.03647MiB into 7590 rows
     0ms 5340 atoms totaling 39238 bytes of backing storage
    19ms Parsed: 1.42218MiB into 10500 rows
     0ms 6901 atoms totaling 51720 bytes of backing storage
    15ms Parsed: 1.14505MiB into 8440 rows
     0ms 7917 atoms totaling 59821 bytes of backing storage
    14ms Parsed: 1.26198MiB into 9287 rows
     0ms 8709 atoms totaling 66346 bytes of backing storage
    15ms Parsed: 1.35513MiB into 9888 rows
     0ms 9456 atoms totaling 72769 bytes of backing storage
    14ms Parsed: 1.25216MiB into 9214 rows
     0ms 10055 atoms totaling 77640 bytes of backing storage
    15ms Parsed: 1.19052MiB into 8781 rows
     0ms 10617 atoms totaling 82271 bytes of backing storage
    86ms Parsed: 9.8172MiB into 72176 rows
     0ms 10617 atoms totaling 82271 bytes of backing storage
     0ms Done
     0ms ~AtomPool deduplicated 96.3225% of 288704 insertions

Creating The Graph

Let's create all (normed and unnormed) words as vertices and add all associations (input rows) as edges.

#include <boost/graph/adjacency_list.hpp>
using VertexProp = DataSet::WordAttributes;
using EdgeProp   = DataSet::Association;
using Graph      = boost::adjacency_list<boost::vecS, boost::vecS, boost::directedS, VertexProp, EdgeProp>;

Implementation of buildGraph can be pretty straight-forward using a temporary mapping from atom to vertex descriptor:

Graph buildGraph() const {
    using V = Graph::vertex_descriptor;

    Graph g;
    g.m_vertices.reserve(_atoms.size());

    std::map<Atom, V> vmap; // not flat-map for stability

    for (auto& [assoc, qattr, tattr] : _rows) {
        auto s = vmap.find(assoc.cue);
        auto t = vmap.find(assoc.target);
        if (s == vmap.end()) s = vmap.emplace(assoc.cue, add_vertex(qattr, g)).first;
        if (t == vmap.end()) t = vmap.emplace(assoc.target, add_vertex(tattr, g)).first;

        assert(g[s->second] == qattr); // check consistent input for qattr/tattr
        assert(g[t->second] == tattr);

        add_edge(s->second, t->second, assoc, g);
    }

    return g;
}

In our main let's build the graphs after parsing:

int main(int argc, char** argv) {
    AtomPool pool;

    trace("Start");
    for (std::string fname : std::vector(argv + 1, argv + argc)) {
        Reader r{fname, pool};
        trace("Parsed ", fname);
        {
            Graph g = r.buildGraph();
            trace("Graph creation done; ", num_vertices(g), " vertices and ", num_edges(g), " edges");
        }
        trace("Graph destruction done");
    }
    trace("Done");
}

Live On Coliru

./a.out /Archive2/bd/de4002f5d4d3ee/main.cpp
     0ms Start
   322ms Parsed: 0.00769901MiB into 40 rows
     0ms Parsed /Archive2/bd/de4002f5d4d3ee/main.cpp
     0ms Graph creation done; 70 vertices and 40 edges
     0ms Graph destruction done
     0ms Done
     0ms ~AtomPool deduplicated 12.5% of 80 insertions

For the full dataset, the output becomes:

     0ms Start
    93ms Parsed: 9.8172MiB into 72176 rows
     0ms Parsed dataset/full.txt
    16ms Graph creation done; 10617 vertices and 72176 edges
     1ms Graph destruction done
     0ms Done
     0ms ~AtomPool deduplicated 92.6451% of 144352 insertions

NOTE

There's four more edges than you reported in your question. Perhaps you haven't added the undocumented part-of-speech codes like I did?

As you can see, virtually no time is taken in destruction. This is even copying all the data from the Reader - so the Reader can be disposed after the graph has been created - also so the vertex/edge properties will be mutable.

If you don't need that, you can make everything more memory-efficient using pointers instead.

`listS` EdgeContainerSelector?

I don't see a significant difference, though graph destruction now does indeed take 1ms longer:

     0ms Start
    90ms Parsed: 9.8172MiB into 72176 rows
     0ms Parsed dataset/full.txt
    13ms Graph creation done; 10617 vertices and 72176 edges
     2ms Graph destruction done
     0ms Done
     0ms ~AtomPool deduplicated 92.6451% of 144352 insertions

`bidirectionalS`?

This doesn't seem to make a ton of sense in the application domain¹. And it has significant storage and insertion overhead. However, under some access patterns those might pay-back in terms of improved query performance. YMMV.

Compare Live or locally against the full data set:

     0ms Start
    90ms Parsed: 9.8172MiB into 72176 rows
     0ms Parsed dataset/full.txt
    17ms Graph creation done; 10617 vertices and 72176 edges
     6ms Graph destruction done
     0ms Done
     0ms ~AtomPool deduplicated 92.6451% of 144352 insertions

As predicted, adding inefficiences does turn up in the timings. It's quite significant, although 6x slower destruction is still only 6ms.

¹ The correlation between forward and backward strength for cues whose targets have been normed is positive but not high, r = .29 (n = 63,619), and the chances of correctly guessing back strengths from knowledge of forward strengths are low

Why a graph (about 10k to 100k edges and vertices) takes very long time to delete?

Just The Parser

Creating The Graph

NOTE

listS EdgeContainerSelector?

bidirectionalS?

`listS` EdgeContainerSelector?

`bidirectionalS`?