Search code examples
pythonshellsubprocesspython-multithreading

Python: execute cat subprocess in parallel


I am running several cat | zgrep commands on a remote server and gathering their output individually for further processing:

class MainProcessor(mp.Process):
    def __init__(self, peaks_array):
        super(MainProcessor, self).__init__()
        self.peaks_array = peaks_array

    def run(self):
        for peak_arr in self.peaks_array:
            peak_processor = PeakProcessor(peak_arr)
            peak_processor.start()

class PeakProcessor(mp.Process):
    def __init__(self, peak_arr):
        super(PeakProcessor, self).__init__()
        self.peak_arr = peak_arr

    def run(self):
        command = 'ssh remote_host cat files_to_process | zgrep --mmap "regex" '
        log_lines = (subprocess.check_output(command, shell=True)).split('\n')
        process_data(log_lines)

This, however, results in sequential execution of the subprocess('ssh ... cat ...') commands. Second peak waits for first to finish and so on.

How can I modify this code so that the subprocess calls run in parallel, while still being able to collect the output for each individually?


Solution

  • Another approach (rather than the other suggestion of putting shell processes in the background) is to use multithreading.

    The run method that you have would then do something like this:

    thread.start_new_thread ( myFuncThatDoesZGrep)
    

    To collect results, you can do something like this:

    class MyThread(threading.Thread):
       def run(self):
           self.finished = False
           # Your code to run the command here.
           blahBlah()
           # When finished....
           self.finished = True
           self.results = []
    

    Run the thread as stated above in the link on multithreading. When your thread object has myThread.finished == True, then you can collect the results via myThread.results.