Why is this pointer arithmetic necessary to use the clone function in C?

I'm trying to use the clone() function in C, and am uncertain of how the second argument works. Per the clone() man page:

   The child_stack argument specifies the location of the stack used by  the
   child  process.  Since the child and calling process may share memory, it
   is not possible for the child process to execute in the same stack as the
   calling  process.  The calling process must therefore set up memory space
   for the child stack and pass a pointer to this space to clone().   Stacks
   grow downwards on all processors that run Linux (except the HP PA proces‐
   sors), so child_stack usually points to the topmost address of the memory
   space set up for the child stack.

After following suggestions in the comments on this article, I've been able to get a simple example working using this C program:

#include <stdio.h>
#include <sched.h>
#include <stdlib.h>
#include <assert.h>

#define SIZE 65536

int v1;

int run(void *arg) {
  v1 = 42;
  return 0;
}

int main(int argc, char **argv) {
  void **child_stack;
  int pid, rc, status;
  v1 = 10;
  child_stack = (void **) malloc(SIZE);
  assert(child_stack != NULL);
  printf("v1 before: %d\n", v1);

  pid = clone(run, child_stack + SIZE/sizeof(void **), CLONE_VM, NULL);
  //pid = clone(run, child_stack + SIZE, CLONE_VM, NULL);

  assert(pid != -1);
  status = 0;
  rc = waitpid(pid, &status, __WALL);
  assert(rc != -1);
  assert(WEXITSTATUS(status) == 0);
  printf("v1 after:  %d\n", v1);
  return 0;
}

But I'm confused as to why the particular pointer arithmetic in the clone line is necessary. Given that according to the clone docs the stack is supposed to grow downward, I see why you should add a value to the pointer returned by malloc before passing it in. But I'd expect that you'd add the number of bytes malloc'd, instead of that value divided by 8 (on a 64-bit system), which is what seems to actually work. The code above seems to work fine regardless of what I define SIZE as, but if I use the commented version instead, which is what I'd expect to work, I get a segmentation fault for all SIZE values above a certain threshold.

So, anyone understand why the given clone line works, but the commented one doesn't?

As for why I'm using clone to begin with, instead of fork or pthreads, I'm trying to use some of its advanced sandboxing features to prevent an untrusted process from breaking out of a chroot jail, as described here.

Solution

With pointer arithmetic, the size of the type pointed to is incorporated when determining the actual memory offset, take for example:

int a[2] = {1, 2};
int* p = a;

printf("%x: %x\n", &a[0], p);
printf("%x: %x\n", &a[1], p + 1);

In this case, the value of p isn't just address of p + 1, it's the value of p + 1 * sizeof(int) (the size of the type pointed to). To account for this, when you want to offset some number of bytes, you need to divide the offset by the size of the pointer type you're modifying. In your case, the type you're pointing to is void*, so it may be more accurate to say:

pid = clone(run, child_stack + SIZE/sizeof(void *), CLONE_VM, NULL);

You can visualize this behavior with something like:

int SIZE = 65536;
void** child_stack = (void **) malloc(SIZE);

void** child_stack_end = child_stack + SIZE;
void** child_stack_end2 = child_stack + SIZE / sizeof(*child_stack);

printf("%d\n", (intptr_t)child_stack_end - (intptr_t)child_stack); // "262144"
printf("%d\n", (intptr_t)child_stack_end2 - (intptr_t)child_stack); // "65536"