Search code examples
clinuxpipethread-safety

fgets getting stuck indefinitely in a multi-threaded environment


I have a system which consists of multiple threads. Each thread uses a common function called RunCmd to execute a shell command. Everything works okay, but once in a while one of the threads (always the Alerts thread) gets stuck inside the RunCmd function indefinitely. After adding some logging, it turns out that it gets stuck in the fgets call.

It is also worth noting that it does not matter what the shell command is, but it is getting stuck in the same place.

About the system:

  • This is all running on an embedded Linux system with a TI chip
  • All of these threads are running on a service - only one process (checked the ps aux command)

Weird observations:

  • When observing the logs using journalctl, I see that for that service there seems to be different PIDs reported. Mostly it is the same PID (cross checked with ps aux), but some logs have different PIDs and these PIDs are not available in the ps aux list.

How can I further debug this? and trace logs or something else? Are there any workarounds?

Here is the runCmd functions that gets stuck in that thread:

bool runCmd(char const* command, char* response, size_t bufferSize) {
    FILE* pFile = popen(command, "r");
    if (pFile == NULL) {
        printf("Failed to run command\n" );
        return false;
    }
    if (NULL != response) {
        while (fgets(response, bufferSize, pFile) != NULL) {
            response = response + strlen(response);
        }
    }
    pclose(pFile);
    return true;
}
  • Adding a mutex so only one thread is running a shell command - it just resulted in all the threads getting stuck as the Alerts thread was hogging the mutex and stuck in there
  • Tried freads instead but the same output
  • Tried using select to wait until the pipe was available before reading and timing out if took too long - but no difference

Solution

  • If the rest of the code that we can't see is thread-safe (and no other thread is changing the buffers that command and response points at), I see one bug and two possible causes for the hanging fgets.

    • Bug: The program may fill the response buffer and then start to write out of bounds since you have no boundary check. The loop will tell fgets to read up to bufferSize characters every time even though you are eating away from the available response memory in the loop. This makes the program have undefined behavior and hanging is one possible outcome.

      One possible fix:

      while (bufferSize > 1 && fgets(response, bufferSize, pFile) != NULL) {
          size_t len = strlen(response);
          response += len;
          bufferSize -= len;   // shrink what you tell fgets it can use
      }
      

      You could simplify it by using fread instead. The logic is the same but fread doesn't stop for newlines and returns how many items it read so you don't need to call strlen afterwards.

      if (bufferSize--) {          // leave room for null terminator
          size_t len;
          while (bufferSize && (len = fread(response, 1, bufferSize, pFile)))     {
              response += len;
              bufferSize -= len;   // shrink what you tell fread it can use
          }
          *response = '\0';
      }
      
    • The command simply don't finish. If command is waiting for something that never happens, it'll hang.
      Example: If command would execute this script that waits for input from the user, it would just hang until the user had pressed return. You wouldn't see any output either since fgets is waiting for a newline that doesn't come (until after the user has pressed return).

      #!/bin/bash
      
      echo -n "Enter: "
      read -r var