Search code examples
linuxgolinux-namespaces

golang mount namespace: mounted volume are not cleared after the process exits?


code below, I thought if I starts a process with syscall.CLONE_NEWNS, every mount option inside the namespace will be cleared when the process exits.

but it is not?

package main
import (
        "fmt"
        "os"
        "os/exec"
        "syscall"
)

var command string = "/usr/bin/bash"

func container_command() {

        fmt.Printf("starting container command %s\n", command)
        cmd := exec.Command(command)
        cmd.SysProcAttr = &syscall.SysProcAttr{Cloneflags: syscall.CLONE_NEWPID |
                syscall.CLONE_NEWNS,
        }
        cmd.Stdin = os.Stdin
        cmd.Stdout = os.Stdout
        cmd.Stderr = os.Stderr

        if err := cmd.Run(); err != nil {
                fmt.Println("error", err)
                os.Exit(1)
        }
}

func main() {
        fmt.Printf("starting current process %d\n", os.Getpid())
        container_command()
        fmt.Printf("command ended\n")

}

run this and mount a directory, this directory still exits after the program exits.

[root@localhost go]# go run namespace-1.go
starting current process 7558
starting container command /usr/bin/bash
[root@ns-process go]# mount --bind /home /mnt
[root@ns-process go]# ls /mnt
vagrant
[root@ns-process go]# exit
exit
command ended
[root@localhost go]# ls /mnt
vagrant
[root@localhost go]#

if this is the desired behavior, how is the proc get mounted in container implementations? because if I mount proc inside the namespace, I will get

[root@ns-process go]# mount -t proc /proc
[root@ns-process go]# exit
exit
command ended
[root@localhost go]# mount
mount: failed to read mtab: No such file or directory
[root@localhost go]#

proc has to be remounted to get it back.

update: doing the same in C also gives the same result, I think this should be a intended behavior.

#define _GNU_SOURCE
#include <sys/types.h>
#include <sys/wait.h>
#include <stdio.h>
#include <sched.h>
#include <signal.h>
#include <unistd.h>

#define STACK_SIZE (1024 * 1024)
static char container_stack[STACK_SIZE];

char* const container_args[] = {
    "/bin/bash",
    NULL
};

int container_main(void* arg)
{
        printf("Container [%5d] - inside the container!\n", getpid());
            sethostname("container",10);
            system("mount -t proc proc /proc");
            execv(container_args[0], container_args);
            printf("Something's wrong!\n");
            return 1;
}

int main()
{
    printf("start a container!\n");
    int container_pid = clone(container_main, container_stack+STACK_SIZE,
            CLONE_NEWUTS | CLONE_NEWPID | CLONE_NEWNS | SIGCHLD, NULL);
    waitpid(container_pid, NULL, 0);
    printf("container ended!\n");
    return 0;
}

command output:

[root@localhost ~]# gcc a.c
[root@localhost ~]# ./a.out
start a container!
Container [    1] - inside the container!
[root@container ~]# ps -ef
UID        PID  PPID  C STIME TTY          TIME CMD
root         1     0  0 08:57 pts/0    00:00:00 /bin/bash
root        17     1  0 08:57 pts/0    00:00:00 ps -ef
[root@container ~]# exit
exit
container stopped!
[root@localhost ~]# ps -ef
Error, do this: mount -t proc proc /proc
[root@localhost ~]# cat a.c

Solution

  • This happens due to mount events propagation between namespaces. The propagation type of your mount point is MS_SHARED.

    MS_SHARED: This mount point shares mount and unmount events with other mount points that are members of its "peer group". When a mount point is added or removed under this mount point, this change will propagate to the peer group, so that the mount or unmount will also take place under each of the peer mount points. Propagation also occurs in the reverse direction, so that mount and unmount events on a peer mount will also propagate to this mount point.

    Source - https://lwn.net/Articles/689856/

    The shared:N tag in /proc/self/mountinfo indicates that the mount is sharing propagation events with a peer group:

    $ sudo go run namespace-1.go
    [root@localhost]# mount --bind /home/andrii/test /mnt
    # The propagation type is MS_SHARED
    [root@localhost]# grep '/mnt' /proc/self/mountinfo
    264 175 254:0 /home/andrii/test /mnt rw,noatime shared:1 - ext4 
    /dev/mapper/cryptroot rw,data=ordered
    [root@localhost]# exit
    $ ls /mnt
    test_file
    

    On most Linux distributions the default propagation type is MS_SHARED which is set by systemd. See NOTES in man 7 mount_namespaces:

    Notwithstanding the fact that the default propagation type for new mount points is in many cases MS_PRIVATE, MS_SHARED is typically more useful. For this reason, systemd(1) automatically remounts all mount points as MS_SHARED on system startup. Thus, on most modern systems, the default propagation type is in practice MS_SHARED.

    If you want a fully isolated namespace, you can make all mount points private this way:

    $ sudo go run namespace-1.go
    [root@localhost]# mount --make-rprivate /
    [root@localhost]# mount --bind /home/andrii/test /mnt
    # The propagation type is MS_PRIVATE now
    [root@localhost]# grep '/mnt' /proc/self/mountinfo
    264 175 254:0 /home/andrii/test /mnt rw,noatime - ext4 
    /dev/mapper/cryptroot rw,data=ordered
    [root@localhost]# exit
    $ ls /mnt