c++memory memory-management parallel-processing openmp

OpenMP with private vector .push_back doesnt free all memory after loop finish

When inside a parallel for loop, filling a private vector using push_back causes a significant amount of memory to be left behind after the loop finished and all references to the vector are removed.

std::vector<float> points;

for (int k = 0; k < 2e7; k++)
{
    points.push_back(k);
}

While filling it in this manner leaves no significant memory behind after the loop finishes and scope is exited.

std::vector<float> points;

points.resize(2e7);
for (int k = 0; k < 2e7; k++)
{
    points[k] = k;
}

Any ideas as to the cause of this memory usage and how to clear it?

My actual use case is more complex using the pcl library and it seems that no-matter which way I generate a point-cloud inside a parallel loop, an excessive amount of memory is left behind.

Below is the simplest example i have managed that demonstrates the issue. The first 5 loops use the pre-allocated code, then loops 5-20 use the problematic code

output of code: (loop count, VIRT mem, RES mem)

------------------- omp_test START -------------------
0       endofloop 510 MB        3 MB
1       endofloop 510 MB        3 MB
2       endofloop 510 MB        3 MB
3       endofloop 510 MB        3 MB
4       endofloop 510 MB        3 MB
5       endofloop 510 MB        3 MB
6       endofloop 510 MB        227 MB
7       endofloop 510 MB        227 MB
8       endofloop 510 MB        227 MB
9       endofloop 510 MB        227 MB
10      endofloop 510 MB        227 MB
11      endofloop 510 MB        227 MB
12      endofloop 510 MB        227 MB
13      endofloop 510 MB        227 MB
14      endofloop 510 MB        227 MB
15      endofloop 510 MB        227 MB
16      endofloop 510 MB        227 MB
17      endofloop 510 MB        227 MB
18      endofloop 510 MB        227 MB
19      endofloop 510 MB        227 MB
20      endofloop 510 MB        227 MB
21      endofloop 510 MB        227 MB
22      endofloop 510 MB        227 MB
23      endofloop 510 MB        227 MB
24      endofloop 510 MB        227 MB
25      endofloop 510 MB        227 MB
26      endofloop 510 MB        227 MB
27      endofloop 510 MB        227 MB
28      endofloop 510 MB        227 MB
29      endofloop 510 MB        227 MB
30      endofloop 510 MB        227 MB
31      endofloop 510 MB        227 MB
32      endofloop 510 MB        227 MB
33      endofloop 510 MB        227 MB
34      endofloop 510 MB        227 MB
35      endofloop 510 MB        227 MB
36      endofloop 510 MB        227 MB
37      endofloop 510 MB        227 MB
38      endofloop 510 MB        227 MB
39      endofloop 510 MB        227 MB
40      endofloop 510 MB        227 MB
41      endofloop 510 MB        227 MB
42      endofloop 510 MB        227 MB

I have ran using g++ 10 and 9 on Ubuntu.

g++ -fopenmp -DNDEBUG -O2 nestedtest.cpp && ./a.out

nestedtest.cpp


#include <iostream>
#include <stdlib.h>
#include <omp.h>
#include <vector>

#include "stdlib.h"
#include "stdio.h"
#include "string.h"
#include <ctime>
#include <memory>

#include <unistd.h>

int parseLine(char *line) //functions for reading memory usage
{
    // This assumes that a digit will be found and the line ends in " Kb".
    int i = strlen(line);
    const char *p = line;
    while (*p < '0' || *p > '9')
        p++;
    line[i - 3] = '\0';
    i = atoi(p);
    return i;
}

int getValue() //functions for reading memory usage
{              //Note: this value is in KB!
    FILE *file = fopen("/proc/self/status", "r");
    int result = -1;
    char line[128];

    while (fgets(line, 128, file) != NULL)
    {
        if (strncmp(line, "VmSize:", 7) == 0)
        {
            result = parseLine(line);
            break;
        }
    }
    fclose(file);
    return result;
}

int getValueP() //functions for reading memory usage
{               //Note: this value is in KB!
    FILE *file = fopen("/proc/self/status", "r");
    int result = -1;
    char line[128];

    while (fgets(line, 128, file) != NULL)
    {
        if (strncmp(line, "VmRSS:", 6) == 0)
        {
            result = parseLine(line);
            break;
        }
    }
    fclose(file);
    return result;
}

int main(int argc, char *argv[])
{

    std::cout << "------------------- omp_test START -------------------" << std::endl;

    for (int i = 0; i < 100; i++)
    {
#pragma omp parallel for schedule(dynamic, 1) num_threads(8)

        // #pragma omp parallel for schedule(dynamic) firstprivate(points)

        for (int j = 0; j < 20; j++)
        {

            if (i > 5 && i < 20)
            {

                std::vector<float> points;

                for (int k = 0; k < 2e7; k++)
                {
                    points.push_back(k);
                }
            }
            else
            {
                std::vector<float> points;

                points.resize(2e7);
                for (int k = 0; k < 2e7; k++)
                {
                    points[k] = k;
                }
            }
        }

#pragma omp critical

        {
            usleep(10000);

            std::cout << i << "\tendofloop " << getValue() / 1024 << " MB\t" << (getValueP() / 1024) << " MB" << std::endl;
        }
    }
    exit(0);
}

// end: main()

Solution

What you observe is simply how memory allocation works.

glibc makes use of two different kinds of allocation, described in details here. In a nutshell, small allocations are served as pieces cut out from larger memory allocations called arenas. When you free such a piece, it is returned back to its arena and the RSS of the process doesn't decrease. Larger allocations are taken directly from the OS using anonymous memory mappings. Freeing large allocations unmaps the memory region, which restores the value of RSS to its earlier value, provided that no new pages are mapped into other memory regions.

The reason for having two different algorithms is that asking the kernel to perform memory mapping is a slow operation as it involves modification to the process page tables. Cutting out pieces of a larger block happens entirely in user space, but it is suboptimal for large allocations.

When you push_back() in a loop to an std::vector without resizing it first, the vector will continuously reallocate on demand doubling its storage size. Initially, the allocations will be small enough and will come from the arena until it reaches a point where it starts memory mapping from the OS. When you do points.resize(2e7), you force the vector to allocate all the storage at once, which is directly served by an anonymous memory mapping operation.

glibc uses per-thread arenas and creates a new arena for each thread. The arena of the main thread can shrink back if you free blocks at its very end. The arenas of the other threads do not shrink that readily. This is due to the different mechanisms used to allocate memory for the arenas. The main one is (de-)allocated by moving the program break, also known as the end of the data segment. The other arenas employ anonymous memory mappings.

When OpenMP spawns a new thread, it starts with a small arena. When the vector starts asking for increasingly larger memory blocks, the arena grows in order to accommodate the demand. At one point, the vector grows so large it starts receiving anonymous memory mappings instead. When those mappings aren't too big, instead of unmapping them upon destruction of the vector they get added to the arena as cache for future memory allocations (unmapping is also quite expensive). Not only this, but also the threshold for the switch to anonymous memory mappings gets updated upwards. This is why each thread ends up having 32 MiB of cached memory in each arena, which multiplied by 7 threads gives 224 MiB. Add a couple of MiB from the main arena and you get 227 MiB.

If having 32 MiB of memory cached in each thread makes you uneasy, simply tell glibc to not dynamically increase the memory mapping threshold by setting it to a fixed value via the MALLOC_MMAP_THRESHOLD_ environment variable:

$ ./a.out
...
5       endofloop 524 MB        3 MB
6       endofloop 524 MB        227 MB
...

$ MALLOC_MMAP_THRESHOLD_=4194304 ./a.out
...
5       endofloop 524 MB        3 MB
6       endofloop 524 MB        4 MB
...

Doing so may negatively impact the performance though.

glibc provides the non-standard malloc_stats() call (include malloc.h), which prints general information about the arenas.