Search code examples
socketslinux-kernelgdbstraceptrace

Hung processes resume if attached to strace


I have a network program written in C using TCP sockets. Sometimes the client program hangs forever expecting input from server. Specifically, the client hangs on select() call set on an fd intended to read characters sent by server.

I am using strace to know where the process got stuck. However, sometimes when I attach the hung client process to strace, it immediately resumes it's execution and properly exits. Not all hung processes exhibit this behavior, some processes stuck in the select() even if I attach them to strace. But most of the processes resume their execution when attached to strace.

I am curious what causing the processes resume when attached to strace. It might give me clues to know why client processes are getting hung.

Any ideas? what causes a hung process to resume it's execution when attached to strace?

Update:

Here's the output of strace on hung processes.

> sudo strace -p 25645
Process 25645 attached - interrupt to quit
--- SIGSTOP (Stopped (signal)) @ 0 (0) ---
--- SIGSTOP (Stopped (signal)) @ 0 (0) ---
[ Process PID=25645 runs in 32 bit mode. ]
select(6, [3 5], NULL, NULL, NULL)      = 2 (in [3 5])
read(5, "\0", 8192)                     = 1
write(2, "", 0)                         = 0
read(3, "====Setup set_oldtempbehaio"..., 8192) = 555
write(1, "====Setup set_oldtempbehaio"..., 555) = 555
select(6, [3 5], NULL, NULL, NULL)      = 2 (in [3 5])
read(5, "", 8192)                       = 0
read(3, "", 8192)                       = 0
close(5)                                = 0
kill(25652, SIGKILL)                    = 0
exit_group(0)                           = ?
Process 25645 detached

_

> sudo strace -p 14462
Process 14462 attached - interrupt to quit
[ Process PID=14462 runs in 32 bit mode. ]
read(0, 0xff85fdbc, 8192)               = -1 EIO (Input/output error)
shutdown(3, 1 /* send */)               = 0
exit_group(0)                           = ?

_

> sudo strace -p 7517
Process 7517 attached - interrupt to quit
--- SIGSTOP (Stopped (signal)) @ 0 (0) ---
--- SIGSTOP (Stopped (signal)) @ 0 (0) ---
[ Process PID=7517 runs in 32 bit mode. ]
connect(3, {sa_family=AF_INET, sin_port=htons(300), sin_addr=inet_addr("100.64.220.98")}, 16) = -1 ETIMEDOUT (Connection timed out)
close(3)                                = 0
dup(2)                                  = 3
fcntl64(3, F_GETFL)                     = 0x1 (flags O_WRONLY)
close(3)                                = 0
write(2, "dsd13: Connection timed out\n", 30) = 30
write(2, "Error code : 110\n", 17)      = 17
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
exit_group(1)                           = ?
Process 7517 detached

Not just select(), but the processes(of same program) are stuck in various system calls before I attach them to strace. They suddenly resume after attaching to strace. If I don't attach them to strace, they just hang there forever.

Update 2:

I learned that strace could start a process which was previously stopped (process in T sate). Now I am trying to understand why did these processes go to 'T' state, what's the cause. Here's the /proc//status information:

> cat /proc/12554/status
Name:   someone
State:  T (stopped)
SleepAVG:       88%
Tgid:   12554
Pid:    12554
PPid:   9754
TracerPid:      0
Uid:    5000    5000    5000    5000
Gid:    48986   48986   48986   48986
FDSize: 256
Groups: 9149 48986
VmPeak:     1992 kB
VmSize:     1964 kB
VmLck:         0 kB
VmHWM:       608 kB
VmRSS:       608 kB
VmData:      156 kB
VmStk:        20 kB
VmExe:        16 kB
VmLib:      1744 kB
VmPTE:        20 kB
Threads:        1
SigQ:   54/73728
SigPnd: 0000000000000000
ShdPnd: 0000000000000000
SigBlk: 0000000000000000
SigIgn: 0000000000000006
SigCgt: 0000000000004000
CapInh: 0000000000000000
CapPrm: 0000000000000000
CapEff: 0000000000000000
Cpus_allowed:   00000000,00000000,00000000,0000000f
Mems_allowed:   00000000,00000001

Solution

  • strace uses ptrace. The ptrace man page has this:

    Since attaching sends SIGSTOP and the tracer usually suppresses it,
    this may cause a stray EINTR return from the currently executing system
    call in the tracee, as described in the "Signal injection and
    suppression" section.
    

    Are you seeing select return EINTR?