Search code examples
cmultithreadingassemblygdbcoredump

Process Coredumped but does not look like illegal reference in a multithreaded program


Backtrace of the coredump:

#0  0x0000000000416228 in add_to_epoll (struct_fd=0x18d32760, lno=7901) at lbi.c:7092
#1  0x0000000000418b54 in connect_fc (struct_fd=0x18d32760, type=2) at lbi.c:7901
#2  0x0000000000418660 in poll_fc (arg=0x0) at lbi.c:7686
#3  0x00000030926064a7 in start_thread () from /lib64/libpthread.so.0
#4  0x0000003091ed3c2d in clone () from /lib64/libc.so.6

Code Snippet:

#define unExp(x) __builtin_expect((x),0)
... 
7087 int add_to_epoll( struct fdStruct * struct_fd, int lno)
7088 {
7089    struct epoll_event ev;
7090    ev.events = EPOLLIN | EPOLLET | EPOLLPRI | EPOLLERR ;
7091    ev.data.fd = fd_st->fd;
7092    if (unExp(epoll_ctl(struct_fd->Hdr->info->epollfd, EPOLL_CTL_ADD,         struct_fd->fd,&ev) == -1))
7093    {
7094        perror("client FD  ADD to epoll error:");
7095        return -1;
7096    }
7097    else
7098    {
            ...
7109    }
7110    return 1;
7111 }

Disassembly of the offending line. I am not good at interpreting assembly code but have tried my best:

        if (unExp(epoll_ctl(struct_fd->Hdr->info->epollfd, EPOLL_CTL_ADD, stuct_fd->fd,&ev) == -1))
  416210:       48 8b 45 d8             mov    0xffffffffffffffd8(%rbp),%rax // Storing struct_fd->fd
  416214:       8b 10                   mov    (%rax),%edx                   //  to EDX
  416216:       48 8b 45 d8             mov    0xffffffffffffffd8(%rbp),%rax // Storing struct_fd->Hdr->info->epollfd
  41621a:       48 8b 80 e8 01 00 00    mov    0x1e8(%rax),%rax              // to EDI which failed 
  416221:       48 8b 80 58 01 00 00    mov    0x158(%rax),%rax              // while trying to offset members of the structure
  416228:       8b 78 5c                mov    0x5c(%rax),%edi               // <--- failed here since Reg AX is 0x0
  41622b:       48 8d 4d e0             lea    0xffffffffffffffe0(%rbp),%rcx
  41622f:       be 01 00 00 00          mov    $0x1,%esi
  416234:       e8 b7 e1 fe ff          callq  4043f0 <epoll_ctl@plt>
  416239:       83 f8 ff                cmp    $0xffffffffffffffff,%eax
  41623c:       0f 94 c0                sete   %al
  41623f:       0f b6 c0                movzbl %al,%eax
  416242:       48 85 c0                test   %rax,%rax
  416245:       74 5e                   je     4162a5 <add_to_epoll+0xc9>

Printing out Registers and struct member values:

(gdb) i r $rax
rax            0x0      0
(gdb) p struct_fd
$3 = (struct fdStruct *) 0x18d32760
(gdb) p struct_fd->Hdr
$4 = (StHdr *) 0x3b990f30
(gdb) p struct_fd->Hdr->info
$5 = (struct Info *) 0x3b95b410    // Strangely, this is NOT NULL. Inconsistent with assembly dump.
(gdb) p ev
$6 = {events = 2147483659, data = {ptr = 0x573dc648000003d6, fd = 982, u32 = 982, u64= 6286398667419026390}}

Please let me know if my dis-assembly interpretation is OK. And if yes, would like to understand why gdb not showing NULL when it is printing out the structure members.

OR if the analysis is not perfect would like to know the actual reason of coredump. Please let me know if you need more info.

  • Thanks

---- The following part has been added Later ----

The proxy is a multithreaded program. Doing more digging came to know that when the problem occurs the following two thread were running in parallel. And when I avoid the two functions to run parallely the problem never occurs. But, the thing is I cannot explain how this behavior results into the original problematic scene:

Thread 1: 
------------------------------------------------------------
int new_connection() {
   ...
   struct_fd->Hdr->info=NULL; /* (line 1)  */
   ...
   <some code>
   ...
   struct_fd->Hdr->info=Golbal_InFo_Ptr; /* (line 2) */  // This is a malloced memory, once allocated never freed
   ...
   ...
}
------------------------------------------------------------

Thread 2 executing add_to_epoll():
------------------------------------------------------------
int add_to_epoll( struct fdStruct * struct_fd, int lno)
{
   ...
   if (unExp(epoll_ctl(struct_fd->Hdr->info->epollfd,...)  /* (line 3) */
   ...
}
------------------------------------------------------------

In the above snippets if execution is done in the order, LIne 1, Line 3, Line 2, the scene can occur. What I expect is whenever an illegal reference is encountered it should dump immediately without trying to execute LINE 3 which makes it NON NULL. It is a definite behavior because till now I have got around 12 coredumps of the same problem, all showing the exact same thing.


Solution

  • It is clear that struct_fd->Hdr->info is NULL, as Per Johansson already answered.

    However, GDB thinks that it is not. How could that be?

    One common way this happens, is when

    1. you change the layout of struct fdStruct, struct StHdr (or both), and
    2. you neglect to rebuild all objects that use these definitions

    The disassembly shows that offsetof(struct fdStruct, Hdr) == 0x1e8 and offsetof(struct StHdr, info) == 0x158. See what GDB prints for the following:

    (gdb) print/x (char*)&struct_fd->Hdr - (char*)struct_fd
    (gdb) print/x (char*)&struct_fd->Hdr->info - (char*)struct_fd->Hdr
    

    I bet it would print something other than 0x1e8 and 0x158.

    If that's the case, make clean && make may fix the problem.

    Update:

    (gdb) print/x (char*)&struct_fd->Hdr - (char*)struct_fd
    $1 = 0x1e8
    (gdb) print/x (char*)&struct_fd->Hdr->info - (char*)struct_fd->Hdr
    $3 = 0x158
    

    This proves that GDB's idea of how objects are laid out in memory matches compiled code.

    We still don't know whether GDB's idea of the value of struct_fd matches reality. What do these commands print?

    (gdb) print struct_fd
    (gdb) x/gx $rbp-40
    

    They should produce the same value (0x18d32760). Assuming they do, the only other explanation I can think of is that you have multiple threads accessing struct_fd, and the other thread overwrites the value that used to be NULL with the new value.

    I just noticed your update to the question ;-)

    What I expect is whenever an illegal reference is encountered it should dump immediately without trying to execute LINE 3 which makes it NON NULL.

    Your expectation is incorrect: on any modern CPU, you have multiple cores, and your threads are executing simultaneously. That is, you have this code (time goes down along Y axis):

    char *p;  // global
    
    
    Time     CPU0                  CPU1
    0        p = NULL
    1        if (*p)               p = malloc(1)
    2                              *p = 'a';
    ...
    

    At T1, CPU0 traps into the OS, but CPU1 continues. Eventually, the OS processes hardware trap, and dumps memory state at that time. On CPU1, hundreds of instructions may have executed after T1. The clocks between CPU0 and CPU1 aren't even synchronized, they don't necessarily go in lock-step.

    Moral of the story: don't access global variables from multiple threads without proper locking.