Search code examples
clinuxlinux-kernelkernel

Why would a system call like clone() fail, resume, and pause?


Here is my situation: I am writing a UNIX | for a Operating Systems course. I finished my implementation; however, I noticed that I was non-deterministically passing test cases. The test case in question is of the form:

    def test_no_orphans(self):
        self.assertTrue(self.make, msg='make failed')
        subprocess.call(('strace', '-o', 'trace.log','./pipe','ls','wc','cat','cat'))
        ps = subprocess.Popen(['grep','-o','clone(','trace.log'], stdout=subprocess.PIPE)
        out1 = subprocess.check_output(('wc','-l'), stdin=ps.stdout)
        ps.wait()        
        ps.stdout.close()
        ps = subprocess.Popen(['grep','-o','wait','trace.log'], stdout=subprocess.PIPE)
        out2 = subprocess.check_output(('wc','-l'), stdin=ps.stdout)
        ps.wait()  
        ps.stdout.close()
        out1 = int(out1.decode("utf-8")[0])
        out2 = int(out2.decode("utf-8")[0])
        if out1 == out2 or out1 < out2:
            orphan_check = True
        else:
            orphan_check = False
        self.assertTrue(orphan_check, msg="Found orphan processes")
        subprocess.call(['rm', 'trace.log'])
        self.assertTrue(self._make_clean, msg='make clean failed')

After inspecting the relevant log, I find that:

clone(child_stack=NULL, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f04174ad850) = 1330
close(4)                                = 0
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=1330, si_uid=1000, si_status=0, si_utime=0, si_stime=0} ---
pipe([4, 5])                            = 0
clone(child_stack=NULL, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f04174ad850) = 1331
close(5)                                = 0
close(3)                                = 0
pipe([3, 5])                            = 0
clone(child_stack=NULL, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f04174ad850) = ? ERESTARTNOINTR (To be restarted)
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=1331, si_uid=1000, si_status=0, si_utime=0, si_stime=0} ---
clone(child_stack=NULL, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f04174ad850) = 1332
close(5)                                = 0
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=1332, si_uid=1000, si_status=0, si_utime=0, si_stime=0} ---
close(4)                                = 0
pipe([4, 5])                            = 0
clone(child_stack=NULL, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f04174ad850) = 1333

specifically, the line with ERESTARTNOINTR (To be restarted) is causing me to fail. I reported this to the TA in charge of the assignment after my professor helped me debug and said it is possibly a bug in the test cases.

I was wondering why exactly clone() could fail, as it seems like fork() handles this in a nice C wrapper for us. I was also wondering what specifically is happening here, and also why it seems that system calls in general seem to have an incomplete and a resumed state, as my OS professor said he doesn't know too much about it, and he basically said it could be "race conditions in the kernel".


Solution

  • The man page for clone(2) says that ERESTARTNOINTR is returned when

    System call was interrupted by a signal and will be restarted. (This can be seen only during a trace.)

    The failing clone syscall in the provided log was interrupted by the SIGCHLD signal for pid 1331.