Using cgexec vs cgroup.procs for memory accounting using cgroups

I ran into an interesting situation yesterday with the cgroups memory controller. I have always thought that the memory reported by cgroups was the processes' total memory consumption, but it seems like that is not the case.

I wrote the following Java programming for testing:

import java.util.Scanner;

class TestApp {

  public static void main(String args[]) {

    int[] arr;

    Scanner in = new Scanner(System.in);
    System.out.println("Press enter to allocate memory");
    in.nextLine();

    arr = new int[1024*1024];
    System.out.println("Allocated memory");
    while(true);
  }

}

When running the above with cgexec, the memory usage is vastly different from when echoing the PID of the JVM into the cgroup.procs file of the cgroup. It seems like cgroups report memory usage for the process after it has been placed inside the cgroup.

How does cgroup account for memory? It seems like when using cgexec, the JVMs consumption is accounted for. On the other hand, when starting the JVM outside of the cgroup, and moving it into it later by writing the PID into the cgroup.procs file, the memory consumption reported in memory.usage_in_bytes remains zero, until I hit enter and consumption goes up to 1024 * 1024 * 4 as expected.

Furthermore, the memory consumption reported by cgroups is not entirely the same as the memory consumption reported by top, for example.

Edit: Created the following C program and used it for testing. I am seeing the same results. If using cgclassify, memory utilization remains 0 until hitting enter. On the other hand, when using cgexec, memory utilization is > 0 before hitting enter.

#include <stdio.h>
#include <stdlib.h>

int main() {

  printf("Press ENTER to consume memory\n");
  getchar();

  char *ptr = malloc(1024*1024);
  if (ptr == NULL) {
    printf("Out of memory");
    exit(1);
  }

  memset(ptr, 0, 1024*1024);

  printf("Press ENTER to quit\n");
  getchar();

  return(0);
}

Solution

When you allocate a page and it is paged in by a process, the allocated memory is tagged with an identifier, telling the kernel which specific memory controller cgroup this memory belongs to (obviously the memory will also belong to any parent of the cgroup).

When you migrate a process to a new cgroup, the memory already allocated doesn't change its tag. It would be very expensive to "retag" everything, and it wouldn't even make sense (suppose that a page is shared by two processes and you migrate only one to a different cgroup. What would the "new" tag need to be? It's now being used by two processes in different cgroups...)

So if you're sitting in the /sys/fs/cgroup/memory cgroup (i.e. your task group ID is mentioned in /sys/fs/cgroup/memory/tasks and not in the tasks file of any children of that cgroup), anything you allocate is accounted against that cgroup and that cgroup only.

When you migrate to a different cgroup (or a child cgroup) only new memory allocations are tagged to belong to that new cgroup.

cgexec will start the JVM in a cgroup, so anything allocated at initialisation time will already belong to the cgroup created especially for what you execute.

If you start a JVM in the root cgroup for the memory controller, then anything allocated and touched when initialising the JVM will belong to the root cgroup.

Once you migrate the JVM to its own private cgroup (with either mechanism) and then you allocate and touch some pages, then obviously these will belong to the new cgroup.