python python-multiprocessing amazon-emr

Is there a way to wait for another python script called from current script (using subprocess.Propen()) till its complete?

I am trying to run 2 python scripts (say dp_01.py and dp_02.py) from a master python script. I want to execute them one after the other. This is my current code in master python script

job1_exec = "python dp_01.py"
try:
    #os.system(job1_exec)
    command1 = subprocess.Popen(job1_exec, shell=True)
    command1.wait()
except:
    print("processing of dp_01 code failed")
print("Processing of dp_01 code completed")


job2_exec = "python dp_02.py"
try:
    #os.system(job2_exec)
    command2 = subprocess.Popen(job2_exec, shell=True)
    command2.wait()
except:
    print("processing of dp_02 code failed")
print("Processing of dp_02 code completed")

The issue here is, the master script is not waiting for dp_01.py to complete its execution. It instantly starts executing dp_02.py. How to wait for dp_01.py execution to complete before the execution of dp_02.py starts?

Solution

A solution could be to replace Popen with check_output or run. check_output is a wrapper around run that preserves and returns the subprocess' stdout, and also blocks the main thread while running.

According to this SO question,

The main difference [between 'run' and 'Popen'] is that subprocess.run executes a command and waits for it to finish, while with subprocess.Popen you can continue doing your stuff while the process finishes and then just repeatedly call subprocess.communicate yourself to pass and receive data to your process.

Let's consider two different jobs where number 1 takes considerably longer to perform than job number 2, simulated here by a sleep(7)

# dp_01.py
import time
time.sleep(7)
print("--> Hello from dp_01", end="")

and,

# dp_02.py
print("--> Hello from dp_02", end="")

Then, for the simplicity of testing, I move the job-performance of the main script into functions,

import time
import subprocess

jobs = ["dp_01.py", "dp_02.py"]

# The current approach of the OP, using 'Popen':
def do(job):
  subprocess.Popen("python "+job, shell=True)


# Alternative, main-thread-blocking approach,
def block_and_do(job):
  out = subprocess.check_output("python "+job, shell=True)
  print(out.decode('ascii')) # 'out' is a byte string


# Helper function for testing the synchronization
# of the two different job-workers above
def test_sync_of_function(worker_func, jobs):
  test_started = time.time()
  for job in jobs:
    print("started job '%s', at time %d" % (job, time.time() - test_started))
    worker_func(job)
    print("completed job '%s', at time %d" % (job, time.time() - test_started))
    time.sleep(1)

This results in:

test_sync_of(do, jobs)

starting job 'dp_01.py', at time 0
completed job 'dp_01.py', at time 0
starting job 'dp_02.py', at time 1
completed job 'dp_02.py', at time 1
 --> Hello from dp_02 !
 --> Hello from dp_01 !

while,

test_sync_of(block_and_do, jobs)

starting job 'dp_01.py', at time 0
 --> Hello from dp_01 !
completed job 'dp_01.py', at time 7
starting job 'dp_02.py', at time 8
 --> Hello from dp_02 !
completed job 'dp_02.py', at time 8

Finally, I hope this solves your problem. However, this might not be the best solution to your grand problem? You might want to take a closer look at the multiprocessing module. Maybe your jobs in the separate scripts could be imported as modules and their work threaded?

A final note: you should be very careful when using shell=True: see this other SO question for details.