Backtrace of the coredump:
#0 0x0000000000416228 in add_to_epoll (struct_fd=0x18d32760, lno=7901) at lbi.c:7092
#1 0x0000000000418b54 in connect_fc (struct_fd=0x18d32760, type=2) at lbi.c:7901
#2 0x0000000000418660 in poll_fc (arg=0x0) at lbi.c:7686
#3 0x00000030926064a7 in start_thread () from /lib64/libpthread.so.0
#4 0x0000003091ed3c2d in clone () from /lib64/libc.so.6
Code Snippet:
#define unExp(x) __builtin_expect((x),0)
...
7087 int add_to_epoll( struct fdStruct * struct_fd, int lno)
7088 {
7089 struct epoll_event ev;
7090 ev.events = EPOLLIN | EPOLLET | EPOLLPRI | EPOLLERR ;
7091 ev.data.fd = fd_st->fd;
7092 if (unExp(epoll_ctl(struct_fd->Hdr->info->epollfd, EPOLL_CTL_ADD, struct_fd->fd,&ev) == -1))
7093 {
7094 perror("client FD ADD to epoll error:");
7095 return -1;
7096 }
7097 else
7098 {
...
7109 }
7110 return 1;
7111 }
Disassembly of the offending line. I am not good at interpreting assembly code but have tried my best:
if (unExp(epoll_ctl(struct_fd->Hdr->info->epollfd, EPOLL_CTL_ADD, stuct_fd->fd,&ev) == -1))
416210: 48 8b 45 d8 mov 0xffffffffffffffd8(%rbp),%rax // Storing struct_fd->fd
416214: 8b 10 mov (%rax),%edx // to EDX
416216: 48 8b 45 d8 mov 0xffffffffffffffd8(%rbp),%rax // Storing struct_fd->Hdr->info->epollfd
41621a: 48 8b 80 e8 01 00 00 mov 0x1e8(%rax),%rax // to EDI which failed
416221: 48 8b 80 58 01 00 00 mov 0x158(%rax),%rax // while trying to offset members of the structure
416228: 8b 78 5c mov 0x5c(%rax),%edi // <--- failed here since Reg AX is 0x0
41622b: 48 8d 4d e0 lea 0xffffffffffffffe0(%rbp),%rcx
41622f: be 01 00 00 00 mov $0x1,%esi
416234: e8 b7 e1 fe ff callq 4043f0 <epoll_ctl@plt>
416239: 83 f8 ff cmp $0xffffffffffffffff,%eax
41623c: 0f 94 c0 sete %al
41623f: 0f b6 c0 movzbl %al,%eax
416242: 48 85 c0 test %rax,%rax
416245: 74 5e je 4162a5 <add_to_epoll+0xc9>
Printing out Registers and struct member values:
(gdb) i r $rax
rax 0x0 0
(gdb) p struct_fd
$3 = (struct fdStruct *) 0x18d32760
(gdb) p struct_fd->Hdr
$4 = (StHdr *) 0x3b990f30
(gdb) p struct_fd->Hdr->info
$5 = (struct Info *) 0x3b95b410 // Strangely, this is NOT NULL. Inconsistent with assembly dump.
(gdb) p ev
$6 = {events = 2147483659, data = {ptr = 0x573dc648000003d6, fd = 982, u32 = 982, u64= 6286398667419026390}}
Please let me know if my dis-assembly interpretation is OK. And if yes, would like to understand why gdb not showing NULL when it is printing out the structure members.
OR if the analysis is not perfect would like to know the actual reason of coredump. Please let me know if you need more info.
---- The following part has been added Later ----
The proxy is a multithreaded program. Doing more digging came to know that when the problem occurs the following two thread were running in parallel. And when I avoid the two functions to run parallely the problem never occurs. But, the thing is I cannot explain how this behavior results into the original problematic scene:
Thread 1:
------------------------------------------------------------
int new_connection() {
...
struct_fd->Hdr->info=NULL; /* (line 1) */
...
<some code>
...
struct_fd->Hdr->info=Golbal_InFo_Ptr; /* (line 2) */ // This is a malloced memory, once allocated never freed
...
...
}
------------------------------------------------------------
Thread 2 executing add_to_epoll():
------------------------------------------------------------
int add_to_epoll( struct fdStruct * struct_fd, int lno)
{
...
if (unExp(epoll_ctl(struct_fd->Hdr->info->epollfd,...) /* (line 3) */
...
}
------------------------------------------------------------
In the above snippets if execution is done in the order, LIne 1, Line 3, Line 2, the scene can occur. What I expect is whenever an illegal reference is encountered it should dump immediately without trying to execute LINE 3 which makes it NON NULL. It is a definite behavior because till now I have got around 12 coredumps of the same problem, all showing the exact same thing.
It is clear that struct_fd->Hdr->info
is NULL
, as Per Johansson already answered.
However, GDB thinks that it is not. How could that be?
One common way this happens, is when
struct fdStruct
, struct StHdr
(or both),
andThe disassembly shows that offsetof(struct fdStruct, Hdr) == 0x1e8
and offsetof(struct StHdr, info) == 0x158
. See what GDB prints for the following:
(gdb) print/x (char*)&struct_fd->Hdr - (char*)struct_fd
(gdb) print/x (char*)&struct_fd->Hdr->info - (char*)struct_fd->Hdr
I bet it would print something other than 0x1e8
and 0x158
.
If that's the case, make clean && make
may fix the problem.
Update:
(gdb) print/x (char*)&struct_fd->Hdr - (char*)struct_fd
$1 = 0x1e8
(gdb) print/x (char*)&struct_fd->Hdr->info - (char*)struct_fd->Hdr
$3 = 0x158
This proves that GDB's idea of how objects are laid out in memory matches compiled code.
We still don't know whether GDB's idea of the value of struct_fd
matches reality. What do these commands print?
(gdb) print struct_fd
(gdb) x/gx $rbp-40
They should produce the same value (0x18d32760
). Assuming they do, the only other explanation I can think of is that you have multiple threads accessing struct_fd, and the other thread overwrites the value that used to be NULL with the new value.
I just noticed your update to the question ;-)
What I expect is whenever an illegal reference is encountered it should dump immediately without trying to execute LINE 3 which makes it NON NULL.
Your expectation is incorrect: on any modern CPU, you have multiple cores, and your threads are executing simultaneously. That is, you have this code (time goes down along Y axis):
char *p; // global
Time CPU0 CPU1
0 p = NULL
1 if (*p) p = malloc(1)
2 *p = 'a';
...
At T1, CPU0 traps into the OS, but CPU1 continues. Eventually, the OS processes hardware trap, and dumps memory state at that time. On CPU1, hundreds of instructions may have executed after T1. The clocks between CPU0 and CPU1 aren't even synchronized, they don't necessarily go in lock-step.
Moral of the story: don't access global variables from multiple threads without proper locking.