Search code examples
pythonmultithreadingsubprocesspopen

Python-Subprocess-Popen inconsistent behavior in a multi-threaded environment


I have following piece of code running inside thread.. 'executable' produces unique string output for each input 'url':

p = Popen(["executable", url], stdout=PIPE, stderr=PIPE, close_fds=True)
output,error = p.communicate()
print output

when above code gets executed for multiple input 'urls', the subprocess p's 'output' produced is not consistent.For some of the urls, subprocess gets terminated without producing any 'output'. I tried printing p.returncode for each failed 'p' instance(failed urls are not consistent across multiple runs either) and got '-11' as a return code with 'error' value as empty string.Can someone please suggest a way to get consistent behavior/output for each run in a multithreaded environment?


Solution

  • -11 as a return code might mean that C program is not fine e.g., you are starting too many subprocesses and it causes SIGSERV in the C executable. You can limit number of concurrent subprocesses using multiprocessing.ThreadPool, concurrent.futures.ThreadPoolExecutor, threading + Queue -based solutions:

    #!/usr/bin/env python
    from multiprocessing.dummy import Pool # uses threads
    from subprocess import Popen, PIPE
    
    def get_url(url):
        p = Popen(["executable", url], stdout=PIPE, stderr=PIPE, close_fds=True)
        output, error = p.communicate()
        return url, output, error, p.returncode
    
    pool = Pool(20) # limit number of concurrent subprocesses
    for url, output, error, returncode in pool.imap_unordered(get_url, urls):
        print("%s %r %r %d" % (url, output, error, returncode))
    

    Make sure the executable can be run in parallel e.g., it doesn't use some shared resource. To test, you could run in a shell:

    $ executable url1 & executable url2
    

    Could you please explain more about "you are starting too many subprocesses and it causes SIGSERV in the C executable" and possibly solution to avoid that..

    Possible problem:

    • "too many processes"
    • -> "not enough memory in the system or some other resource"
    • -> "trigger the bug in the C code that otherwise is hidden or rare"
    • -> "illegal memory access"
    • -> SIGSERV

    The suggested above solution is:

    • "limit number of concurrent processes"
    • -> "enough memory or other resources in the system"
    • -> "bug is hidden or rare"
    • -> no SIGSERV

    Understand what is SIGSEGV run time error in c++? In short, your program is killed with that signal if it tries to access a memory that it is not supposed to. Here's an example of such program:

    /* try to fail with SIGSERV sometimes */
    #include <stdio.h>
    #include <stdlib.h>
    #include <time.h>
    
    int main(void) {
      char *null_pointer = NULL;
    
      srand((unsigned)time(NULL));
    
      if (rand() < RAND_MAX/2) /* simulate some concurrent condition 
                                  e.g., memory pressure */
        fprintf(stderr, "%c\n", *null_pointer); /* dereference null pointer */
    
      return 0;
    }
    

    If you run it with the above Python script then it would return -11 occasionally.

    Also p.returncode is not sufficient for debugging purpose..Is there any other option to get more DEBUG info to get to the root cause?

    I won't exclude the Python side completely but It is most likely that the problem is the C program. You could use gdb to get a backtrace to see where in a callstack the error comes from.