Search code examples
c++linuxmemorylinux-kernelmmap

munmap() when processes share file descriptor table, but not virtual memory


I have unnamed interprocess shared memory regions that are created via mmap. Processes are created via the clone system call. Processes share the file descriptor table (CLONE_FILES), file system information (CLONE_FS). The processes do not share the memory space (except for regions mapped previoulsy to the clone call):

mmap(NULL, sz, PROT_READ | PROT_WRITE, MAP_SHARED | MAP_ANONYMOUS, -1, 0);
syscall(SYS_clone, CLONE_FS | CLONE_FILES | SIGCHLD, nullptr);

My question is -- what happens exactly if, after the fork, one (or both) the processes call munmap()?

My understanding is that munmap() will do two things:

  • Unmap the memory region (which in my case is not propagated between processes)
  • If it's a anonymous mapping, close the file descriptor (which in my case is propagated between processes)

I assume MAP_ANONYMOUS creates some sort of virtual file handled by the kernel (probably located in /proc?) that is automatically closed on munmap().

Therefore... the other process will map into memory a file that is not open and may even not exist anymore?

This is all very confusing to me, as I don't find any plausible explanation.

Simple Test

In this test, the two processes are able to issue one munmap() each without any problems.

#include <stdio.h>
#include <unistd.h>
#include <stdlib.h>
#include <stddef.h>
#include <signal.h>
#include <sys/mman.h>
#include <sys/syscall.h>
#include <sched.h>
int main() {
  int *value = (int*) mmap(NULL, sizeof(int), PROT_READ | PROT_WRITE,
                           MAP_SHARED | MAP_ANONYMOUS, -1, 0);
    *value = 0;
  if (syscall(SYS_clone, CLONE_FS | CLONE_FILES | SIGCHLD, nullptr)) {
        sleep(1);
        printf("[parent] the value is %d\n", *value); // reads value just fine
        munmap(value, sizeof(int));
        // is the memory completely free'd now? if yes, why?
    } else {
        *value = 1234;
        printf("[child] set to %d\n", *value);
        munmap(value, sizeof(int));
        // printf("[child] value after unmap is %d\n", *value); // SIGSEGV
        printf("[child] exiting\n");
    }
}

Consecutive Allocations

In this test we map many anonymous regions sequentially.

In my system vm.max_map_count is 65530.

  • If both processes issue munmap(), all goes well and there seems to exist no memory leak (though there is a significant delay to see memory being released; also the program is quite slow as mmap()/munmap() does heavy stuff. Runtime is about 12 seconds.
  • If only the child issues munmap(), the program core dumps after hitting the 65530 mmaps, meaning it's not being unmapped. The program runs slower and slower (initial 1000 mmaps take less than 1ms; last 1000 mmaps take 34 seconds)
  • If only the parent issues munmap(), the program executes normally, with a runtime of about 12 seconds as well. The child automatically unmaps the memory after exiting.

The code I used:

#include <cassert>
#include <thread>
#include <stdio.h>
#include <unistd.h>
#include <stdlib.h>
#include <stddef.h>
#include <signal.h>
#include <sys/mman.h>
#include <sys/syscall.h>
#include <sys/types.h>
#include <sys/wait.h>
#include <sched.h>

#define NUM_ITERATIONS 100000
#define ALLOC_SIZE 4ul<<0

int main() {
    printf("iterations = %d\n", NUM_ITERATIONS);
    printf("alloc size = %lu\n", ALLOC_SIZE);
    assert(ALLOC_SIZE >= sizeof(int));
    assert(ALLOC_SIZE >= sizeof(bool));
    bool *written = (bool*) mmap(NULL, ALLOC_SIZE, PROT_READ | PROT_WRITE,
                               MAP_SHARED | MAP_ANONYMOUS, -1, 0);
    for(int i=0; i < NUM_ITERATIONS; i++) {
        if(i % (NUM_ITERATIONS / 100) == 0) {
            printf("%d%%\n", i / (NUM_ITERATIONS / 100));
        }
    int *value = (int*) mmap(NULL, ALLOC_SIZE, PROT_READ | PROT_WRITE,
                             MAP_SHARED | MAP_ANONYMOUS, -1, 0);
        *value = 0;
        *written = 0;
      if (int rv = syscall(SYS_clone, CLONE_FS | CLONE_FILES | SIGCHLD, nullptr)) {
            while(*written == 0) std::this_thread::yield();
            assert(*value == i);
            munmap(value, ALLOC_SIZE);
            waitpid(-1, NULL, 0);
        } else {
            *value = i;
            *written = 1;
            munmap(value, ALLOC_SIZE);
            return 0;
        }
    }
    return 0;
}

It seems the kernel will keep a reference counter to the anonymous mapping and munmap() decrements this counter. Once the counter reaches zero, the memory will then eventually be reclaimed by the kernel.

The program runtime is nearly-independent from the allocation size. Specifying an ALLOC_SIZE of 4B takes just under 12 seconds while an allocation of 1MB takes little over 13 seconds.

Specifying a variable allocation size of 1ul<<30 - 4096 * i or 1ul<<30 + 4096 * i results in 12.9/13.0 seconds execution times respectively (within error margin).

A few conclusions are:

  • mmap() takes (approximately?) the same time independently of the allocation area
  • mmap() takes longer depending on the number of already-existing mappings. First 1000 mmaps take around 0.05 seconds; 1000 mmaps after having 64000 mmaps take 34 seconds.
  • munmap() needs to be issued on ALL processes mapping the same region for it to be reclaimed by the kernel.

Solution

  • Using the program below I'm able to empirically get some conclusions (even though I have no guarantees they are correct):

    • mmap() takes approximately the same time independently of the allocation area (this is due to efficient memory management by the linux kernel. mapped memory doesn't take space unless it is written to).
    • mmap() takes longer depending on the number of already-existing mappings. First 1000 mmaps take around 0.05 seconds; 1000 mmaps after having 64000 mappings take around 34 seconds. I haven't checked the linux kernel, but probably inserting a mapped region in the index takes O(n) instead of the feasible O(1) in some structures. Kernel patch possible; but probably it's not a problem to anyone but me :-)
    • munmap() needs to be issued on ALL processes mapping the same MAP_ANONYMOUS region for it to be reclaimed by the kernel. This correctly frees the shared memory region.
    #include <cassert>
    #include <cinttypes>
    #include <thread>
    #include <stdio.h>
    #include <unistd.h>
    #include <stdlib.h>
    #include <stddef.h>
    #include <signal.h>
    #include <sys/mman.h>
    #include <sys/syscall.h>
    #include <sys/types.h>
    #include <sys/wait.h>
    #include <sched.h>
    
    #define NUM_ITERATIONS 100000
    #define ALLOC_SIZE 1ul<<30
    #define CLOCK_TYPE CLOCK_PROCESS_CPUTIME_ID
    #define NUM_ELEMS 1024*1024/4
    
    struct timespec start_time;
    
    int main() {
        clock_gettime(CLOCK_TYPE, &start_time);
        printf("iterations = %d\n", NUM_ITERATIONS);
        printf("alloc size = %lu\n", ALLOC_SIZE);
        assert(ALLOC_SIZE >= NUM_ELEMS * sizeof(int));
        bool *written = (bool*) mmap(NULL, sizeof(bool), PROT_READ | PROT_WRITE,
                                   MAP_SHARED | MAP_ANONYMOUS, -1, 0);
        for(int i=0; i < NUM_ITERATIONS; i++) {
            if(i % (NUM_ITERATIONS / 100) == 0) {
                struct timespec now;
                struct timespec elapsed;
                printf("[%3d%%]", i / (NUM_ITERATIONS / 100));
                clock_gettime(CLOCK_TYPE, &now);
                if (now.tv_nsec < start_time.tv_nsec) {
                    elapsed.tv_sec = now.tv_sec - start_time.tv_sec - 1;
                    elapsed.tv_nsec = now.tv_nsec - start_time.tv_nsec + 1000000000;
                } else {
                    elapsed.tv_sec = now.tv_sec - start_time.tv_sec;
                    elapsed.tv_nsec = now.tv_nsec - start_time.tv_nsec;
                }
                printf("%05" PRIdMAX ".%09ld\n", elapsed.tv_sec, elapsed.tv_nsec);
            }
        int *value = (int*) mmap(NULL, ALLOC_SIZE, PROT_READ | PROT_WRITE,
                                 MAP_SHARED | MAP_ANONYMOUS, -1, 0);
            *value = 0;
            *written = 0;
          if (int rv = syscall(SYS_clone, CLONE_FS | CLONE_FILES | SIGCHLD, nullptr)) {
                while(*written == 0) std::this_thread::yield();
                assert(*value == i);
                munmap(value, ALLOC_SIZE);
                waitpid(-1, NULL, 0);
            } else {
                for(int j=0; j<NUM_ELEMS; j++)
                    value[j] = i;
                *written = 1;
                //munmap(value, ALLOC_SIZE);
                return 0;
            }
        }
        return 0;
    }