I have three functions that execute 3 different jupyter notebooks using papermill and I want the first (job1) and second (job2) functions to run concurrently and the last function (job3) to run only once the first function (job1) has finished running without any errors. I'm not sure if it makes sense to create a new thread for the second function or how to use the join() method appropriately. I'm running on Windows and for some reason concurrent.futures and multiprocessing don't work, which is why I'm using the thread module.
def job1():
return pm.execute_notebook('notebook1.ipynb',log_output=False)
def job2():
return pm.execute_notebook('notebook2.ipynb',log_output=False)
def job3():
return pm.execute_notebook('notebook3.ipynb',log_output=False)
t1 = threading.Thread(target = job1)
t2 = threading.Thread(target = job2)
t3 = threading.Thread(target = job3)
try:
t1.start()
t1.join()
t2.start()
except:
pass
finally:
t3.start()
I like to start off by visualizing the desired flow, which I understand to look like:
This means that t1 and t2 need to start concurrently and then you need to join on both:
t1.start() # <- Started
t2.start() # <- Started
# t1 and t2 executing concurrently
t1.join()
t2.join()
# wait for both to finish
t3.start()
t3.join()
The t1, t2 join order isn't really important since your program has to wait for the longest running thread anyway. If t1 finishes first it will block on t2, if t2 finishes first it still needs to wait for t1, and then will "no-op" on the t2.join().