When inside a parallel for loop, filling a private vector using push_back causes a significant amount of memory to be left behind after the loop finished and all references to the vector are removed.
std::vector<float> points;
for (int k = 0; k < 2e7; k++)
{
points.push_back(k);
}
While filling it in this manner leaves no significant memory behind after the loop finishes and scope is exited.
std::vector<float> points;
points.resize(2e7);
for (int k = 0; k < 2e7; k++)
{
points[k] = k;
}
Any ideas as to the cause of this memory usage and how to clear it?
My actual use case is more complex using the pcl library and it seems that no-matter which way I generate a point-cloud inside a parallel loop, an excessive amount of memory is left behind.
Below is the simplest example i have managed that demonstrates the issue. The first 5 loops use the pre-allocated code, then loops 5-20 use the problematic code
output of code: (loop count, VIRT mem, RES mem)
------------------- omp_test START -------------------
0 endofloop 510 MB 3 MB
1 endofloop 510 MB 3 MB
2 endofloop 510 MB 3 MB
3 endofloop 510 MB 3 MB
4 endofloop 510 MB 3 MB
5 endofloop 510 MB 3 MB
6 endofloop 510 MB 227 MB
7 endofloop 510 MB 227 MB
8 endofloop 510 MB 227 MB
9 endofloop 510 MB 227 MB
10 endofloop 510 MB 227 MB
11 endofloop 510 MB 227 MB
12 endofloop 510 MB 227 MB
13 endofloop 510 MB 227 MB
14 endofloop 510 MB 227 MB
15 endofloop 510 MB 227 MB
16 endofloop 510 MB 227 MB
17 endofloop 510 MB 227 MB
18 endofloop 510 MB 227 MB
19 endofloop 510 MB 227 MB
20 endofloop 510 MB 227 MB
21 endofloop 510 MB 227 MB
22 endofloop 510 MB 227 MB
23 endofloop 510 MB 227 MB
24 endofloop 510 MB 227 MB
25 endofloop 510 MB 227 MB
26 endofloop 510 MB 227 MB
27 endofloop 510 MB 227 MB
28 endofloop 510 MB 227 MB
29 endofloop 510 MB 227 MB
30 endofloop 510 MB 227 MB
31 endofloop 510 MB 227 MB
32 endofloop 510 MB 227 MB
33 endofloop 510 MB 227 MB
34 endofloop 510 MB 227 MB
35 endofloop 510 MB 227 MB
36 endofloop 510 MB 227 MB
37 endofloop 510 MB 227 MB
38 endofloop 510 MB 227 MB
39 endofloop 510 MB 227 MB
40 endofloop 510 MB 227 MB
41 endofloop 510 MB 227 MB
42 endofloop 510 MB 227 MB
I have ran using g++ 10 and 9 on Ubuntu.
g++ -fopenmp -DNDEBUG -O2 nestedtest.cpp && ./a.out
nestedtest.cpp
#include <iostream>
#include <stdlib.h>
#include <omp.h>
#include <vector>
#include "stdlib.h"
#include "stdio.h"
#include "string.h"
#include <ctime>
#include <memory>
#include <unistd.h>
int parseLine(char *line) //functions for reading memory usage
{
// This assumes that a digit will be found and the line ends in " Kb".
int i = strlen(line);
const char *p = line;
while (*p < '0' || *p > '9')
p++;
line[i - 3] = '\0';
i = atoi(p);
return i;
}
int getValue() //functions for reading memory usage
{ //Note: this value is in KB!
FILE *file = fopen("/proc/self/status", "r");
int result = -1;
char line[128];
while (fgets(line, 128, file) != NULL)
{
if (strncmp(line, "VmSize:", 7) == 0)
{
result = parseLine(line);
break;
}
}
fclose(file);
return result;
}
int getValueP() //functions for reading memory usage
{ //Note: this value is in KB!
FILE *file = fopen("/proc/self/status", "r");
int result = -1;
char line[128];
while (fgets(line, 128, file) != NULL)
{
if (strncmp(line, "VmRSS:", 6) == 0)
{
result = parseLine(line);
break;
}
}
fclose(file);
return result;
}
int main(int argc, char *argv[])
{
std::cout << "------------------- omp_test START -------------------" << std::endl;
for (int i = 0; i < 100; i++)
{
#pragma omp parallel for schedule(dynamic, 1) num_threads(8)
// #pragma omp parallel for schedule(dynamic) firstprivate(points)
for (int j = 0; j < 20; j++)
{
if (i > 5 && i < 20)
{
std::vector<float> points;
for (int k = 0; k < 2e7; k++)
{
points.push_back(k);
}
}
else
{
std::vector<float> points;
points.resize(2e7);
for (int k = 0; k < 2e7; k++)
{
points[k] = k;
}
}
}
#pragma omp critical
{
usleep(10000);
std::cout << i << "\tendofloop " << getValue() / 1024 << " MB\t" << (getValueP() / 1024) << " MB" << std::endl;
}
}
exit(0);
}
// end: main()
What you observe is simply how memory allocation works.
glibc makes use of two different kinds of allocation, described in details here. In a nutshell, small allocations are served as pieces cut out from larger memory allocations called arenas. When you free such a piece, it is returned back to its arena and the RSS of the process doesn't decrease. Larger allocations are taken directly from the OS using anonymous memory mappings. Freeing large allocations unmaps the memory region, which restores the value of RSS to its earlier value, provided that no new pages are mapped into other memory regions.
The reason for having two different algorithms is that asking the kernel to perform memory mapping is a slow operation as it involves modification to the process page tables. Cutting out pieces of a larger block happens entirely in user space, but it is suboptimal for large allocations.
When you push_back()
in a loop to an std::vector
without resizing it first, the vector will continuously reallocate on demand doubling its storage size. Initially, the allocations will be small enough and will come from the arena until it reaches a point where it starts memory mapping from the OS. When you do points.resize(2e7)
, you force the vector to allocate all the storage at once, which is directly served by an anonymous memory mapping operation.
glibc uses per-thread arenas and creates a new arena for each thread. The arena of the main thread can shrink back if you free blocks at its very end. The arenas of the other threads do not shrink that readily. This is due to the different mechanisms used to allocate memory for the arenas. The main one is (de-)allocated by moving the program break, also known as the end of the data segment. The other arenas employ anonymous memory mappings.
When OpenMP spawns a new thread, it starts with a small arena. When the vector starts asking for increasingly larger memory blocks, the arena grows in order to accommodate the demand. At one point, the vector grows so large it starts receiving anonymous memory mappings instead. When those mappings aren't too big, instead of unmapping them upon destruction of the vector they get added to the arena as cache for future memory allocations (unmapping is also quite expensive). Not only this, but also the threshold for the switch to anonymous memory mappings gets updated upwards. This is why each thread ends up having 32 MiB of cached memory in each arena, which multiplied by 7 threads gives 224 MiB. Add a couple of MiB from the main arena and you get 227 MiB.
If having 32 MiB of memory cached in each thread makes you uneasy, simply tell glibc to not dynamically increase the memory mapping threshold by setting it to a fixed value via the MALLOC_MMAP_THRESHOLD_
environment variable:
$ ./a.out
...
5 endofloop 524 MB 3 MB
6 endofloop 524 MB 227 MB
...
$ MALLOC_MMAP_THRESHOLD_=4194304 ./a.out
...
5 endofloop 524 MB 3 MB
6 endofloop 524 MB 4 MB
...
Doing so may negatively impact the performance though.
glibc provides the non-standard malloc_stats()
call (include malloc.h
), which prints general information about the arenas.