In my program I create 5 vectors, each with 1 million elements. When I compile my program with O3 optimization, it takes around 2 GB while running. However, if I compile with O3 opitimization and link with the tcmalloc library provided by google-perf it takes only 1.5 GB maximum resident set size. Can someone please explain to me why does this happen? Is linking against tcmalloc always better than linking against glibc malloc?
tcmalloc
is page-oriented, meaning that the internal unit of measure is usually pages rather than bytes. This has the effect of making it easier to reduce fragmentation, and increase locality in various ways.
tcmalloc` defines a page as 8192 bytes, which is actually 2 pages on most linux systems.
Chunks can be thought of as divided in to two top-level categories. "Small" chunks are smaller than kMaxPages (defaults to 128) and are further divided in to size classes and satisfied by the thread caches or the central per-size class caches. "Large" chunks are >= kMaxPages and are always satisfied by the central PageHeap.
more here : http://jamesgolick.com/2013/5/19/how-tcmalloc-works.html