Search code examples
cclonedeadlockrealloc

Why realloc deadlock after clone syscall?


I have a problem that realloc() deadlocks sometime after clone() syscall.

My code is:

#include <sched.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/syscall.h>
#include <sys/types.h>
#include <linux/types.h>
#define CHILD_STACK_SIZE 4096*4
#define gettid() syscall(SYS_gettid)
#define log(str) fprintf(stderr, "[pid:%d tid:%d] "str, getpid(),gettid())

int clone_func(void *arg){
    int *ptr=(int*)malloc(10);
    int i;
    for (i=1; i<200000; i++)
        ptr = realloc(ptr, sizeof(int)*i);
    free(ptr);
    return 0;
}

int main(){
    int flags = 0;
    flags = CLONE_VM;
    log("Program started.\n");
    int *ptr=NULL;
    ptr = malloc(16);
    void *child_stack_start = malloc(CHILD_STACK_SIZE);
    int ret = clone(clone_func, child_stack_start +CHILD_STACK_SIZE, flags, NULL, NULL, NULL, NULL);
    int i;
    for (i=1; i<200000; i++)
        ptr = realloc(ptr, sizeof(int)*i);

    free(ptr);
    return 0;
}

the callstack in gdb is:

[pid:13268 tid:13268] Program started.
^Z[New LWP 13269]

Program received signal SIGTSTP, Stopped (user).
0x000000000040ba0e in __lll_lock_wait_private ()
(gdb) bt
#0  0x000000000040ba0e in __lll_lock_wait_private ()
#1  0x0000000000408630 in _L_lock_11249 ()
#2  0x000000000040797f in realloc ()
#3  0x0000000000400515 in main () at test-realloc.c:36
(gdb) i thr
  2 LWP 13269  0x000000000040ba0e in __lll_lock_wait_private ()
* 1 LWP 13268  0x000000000040ba0e in __lll_lock_wait_private ()
(gdb) thr 2
[Switching to thread 2 (LWP 13269)]#0  0x000000000040ba0e in __lll_lock_wait_private ()
(gdb) bt
#0  0x000000000040ba0e in __lll_lock_wait_private ()
#1  0x0000000000408630 in _L_lock_11249 ()
#2  0x000000000040797f in realloc ()
#3  0x0000000000400413 in clone_func (arg=0x7fffffffe53c) at test-realloc.c:20
#4  0x000000000040b889 in clone ()
#5  0x0000000000000000 in ?? ()

My OS is debian linux-2.6.32-5-amd64, with GNU C Library (Debian EGLIBC 2.11.3-4) stable release version 2.11.3. I deeply suspect that eglibc is the criminal on this bug. On clone() syscall, is it not enough before using realloc()?


Solution

  • You cannot use clone with CLONE_VM yourself -- or if you do, you have to at least make sure you restrict yourself from invoking any function from the standard library after calling clone in either the parent or the child. In order for multiple threads or processes to share the same memory, the implementations of any functions which access shared resources (like the heap) need to

    1. be aware of the fact that multiple flows of control are potentially accessing it so they can arrange to perform the appropriate synchronization, and
    2. be able to obtain information about their own identities via the thread pointer, usually stored in a special machine register. This is completely implementation-internal, and thus you cannot arrange for a new "thread" which you create yourself via clone to have a properly setup thread pointer.

    The proper solution is to use pthread_create, not clone.