Search code examples
clinuxmultithreadingoperating-systemstrace

Why is strace showing ERESTARTSYS for read?


I have a multi-threaded program, which, when run through strace, shows this:

read(10, "lorem ipsum...", 100) = 100
read(10, 0x2ae9ebcb5000, 8191) = ? ERESTARTSYS (To be restarted)
--- SIGTERM ... ---

Whenever the ERESTARTSYS occurs, the program ends up hanging on the read. When the ERESTARTSYS does not occur, the program exits successfully and I get:

read(10, "lorem ipsum...", 100) = 100
read(10, "", 8191) = 0
...
exit_group(0)

Looking at the strace manpage (for an strace that isn't my version) and SO questions like this and this, it seems that the read is being interrupted by some signal. I could be misunderstanding the doc, but I don't see any signal other than SIGTERM, which I'm assuming is from me exiting the program.

I've determined that the two reads are from a std::getline invocation, which reads twice when the delimiter isn't found (it isn't being found because the delimiter is incorrect and nowhere in the string, but I can't fix it because it's in a library I have no control over). Adding the delimiter to the string seems to prevent the second read, which causes the code to run without a problem.

I'm also positive that there's some race condition in the code because when I turn off the parallelism, this error does not occur. One of my wild guesses is that the read is being interrupted during a thread context switch, however that's just a wild guess and nothing in the strace indicates that this is true. Additionally, I'm not sure why it wouldn't simply restart after being switched back in. I can't find the race condition, though, and I was hoping that understanding the strace and the ERESTARTSYS could help me figure out where the bug is.

If it helps, I'm running on RHEL5 and compiling using gcc 4.7.2.


Solution

  • According to this link, this occurs when read is interrupted by strace on RHEL systems. In the code I was looking at, it turned out the read was just hanging, waiting for input because an EOF was not found and because there were still write end open (due to a race condition).