Search code examples
clinuxdebugginggdbsystem-calls

CSAPP tiny shell lab: stuck at sigprocmask


I am trying the tiny shell lab in CSAPP. But my code stucks when I input an command line.

steven@Steven:/mnt/f/大学/CSAPP/cmu15213/shlab-handout$ ./tsh
tsh> 123
tsh> 123: command not found
123
123
123
^\Terminating after receipt of SIGQUIT signal
steven@Steven:/mnt/f/大学/CSAPP/cmu15213/shlab-handout$ 

Link to the lab:

I modified Makefile by adding -Og -g and tried to debug with GDB and VSCode.

I found the program get stuck on the sigprocmask. As shown in the following picture, if I click "Step Over", it continues to run and never stops.
enter image description here

I copy the relevant piece of code and ran it separately, it works correctly.

I have tested this both in WSL and a virtual machine, and both exhibited the same behavior.


Solution

  • if I click "Step Over", it continues to run and never stops.

    I reproduced that. So let's see what's going on.

    gdb -q ./tsh
    (gdb) break tsh.c:191
    (gdb) b tsh.c:191
    Breakpoint 1 at 0x40156a: file tsh.c, line 191.
    
    (gdb) run
    Starting program: /tmp/tsh
    [Thread debugging using libthread_db enabled]
    Using host libthread_db library "/lib64/libthread_db.so.1".
    tsh> aaa
    [Detaching after fork from child process 182]
    
    aaa: command not found
    Breakpoint 1, eval (cmdline=0x7fffffffd930 "aaa\n") at tsh.c:191
    191         sigprocmask(SIG_SETMASK, &prev, NULL);
    (gdb) n
    

    At this point everything hangs, so we must be stuck on sigprocmask, right?

    Actually, we are not.

    ^C
    Program received signal SIGINT, Interrupt.
    0x00007ffff7eb5c37 in __GI___wait4 (pid=-1, stat_loc=0x7fffffffcd5c, options=3, usage=0x0)
        at ../sysdeps/unix/sysv/linux/wait4.c:30
    30        return SYSCALL_CANCEL (wait4, pid, stat_loc, options, usage);
    (gdb) bt
    #0  0x00007ffff7eb5c37 in __GI___wait4 (pid=-1, stat_loc=0x7fffffffcd5c, options=3, usage=0x0)
        at ../sysdeps/unix/sysv/linux/wait4.c:30
    #1  0x0000000000401a02 in sigchld_handler (sig=17) at tsh.c:383
    #2  <signal handler called>
    #3  __GI___pthread_sigmask (how=2, newmask=<optimized out>, oldmask=0x0) at pthread_sigmask.c:43
    #4  0x00007ffff7e18d8d in __GI___sigprocmask (how=<optimized out>, set=<optimized out>, oset=<optimized out>)
        at ../sysdeps/unix/sysv/linux/sigprocmask.c:25
    #5  0x0000000000401583 in eval (cmdline=0x7fffffffd930 "aaa\n") at tsh.c:191
    #6  0x000000000040142d in main (argc=1, argv=0x7fffffffde68) at tsh.c:149
    

    Now we see what's actually going on. The sigprocmask unblocks SIGCHLD, which results in immediate delivery of that signal just before sigprocmask was about to return. That in turn invokes the sigchld_handler, which repeatedly calls waitpid in a never-ending loop.

    Why doesn't the loop terminate? Because the code expects waitpid to return 0 when there are no children, but that is not correct: waitpid returns -1 in that case.

    The following fix makes tsh work as one might expect:

     diff -u tsh.c.orig tsh.c
    --- tsh.c.orig  2024-01-20 21:42:47.915401415 -0800
    +++ tsh.c       2024-01-20 21:43:20.145996657 -0800
    @@ -383,7 +383,7 @@
             pid = waitpid(-1, &status, WNOHANG | WUNTRACED);
    
             // 如果没有僵尸进程,则退出
    -        if (pid == 0)
    +        if (pid == 0 || pid == -1)
                 return;
    
             // 如果子进程终止导致waitpid从阻塞中恢复
    
    ./tsh
    tsh> aaa
    aaa: command not found
    tsh> bbb
    bbb: command not found
    tsh> 
    

    Here is the relevant text from Linux man page:

    If waitpid() was invoked with WNOHANG set in options, it has at least one child process specified by pid for which status is not available, and status is not available for any process specified by pid, 0 is returned. Otherwise, -1 shall be returned