Search code examples
pythonmultithreadingpython-multithreading

How does Thread().join work in the following case?


I saw the following code in a thread tutorial:

from time import sleep, perf_counter
from threading import Thread

start = perf_counter()

def foo():
    sleep(5)

threads = []
for i in range(100):
    t = Thread(target=foo,)
    t.start()
    threads.append(t)

for i in threads:
    i.join()

end = perf_counter()

print(f'Took {end - start}')

When I run it it prints Took 5.014557975. Okay, that part is fine. It does not take 500 seconds as the non threaded version would.

What I don't understand is how .join works. I noticed without calling .join I got Took 0.007060926999999995 which indicates that the main thread ended before the child threads. Since '.join()' is supposed to block, when the first iteration of the loop occurs won't it be blocked and have to wait 5 seconds till the second iteration? How does it still manage to run?

I keep reading python threading is not truly multithreaded and it only appears to be (runs on a single core), but if that is the case then how exactly is the background time running if it's not parallel?


Solution

  • The OS is in control when the thread starts and the OS will context-switch (I believe that is the correct term) between threads.

    time functions access a clock on your computer via the OS - that clock is always running. As long as the OS periodically gives each thread time to access a clock the thread's target can tell if it has been sleeping long enough.

    The threads are not running in parallel, the OS periodically gives each one a chance to look at the clock.

    Here is a little finer detail for what is happening. I subclassed Thread and overrode its run and join methods to log when they are called.

    Caveat The documentation specifically states

    only override __init__ and run methods

    I was surprised overriding join didn't cause problems.

    from time import sleep, perf_counter
    from threading import Thread
    import pandas as pd
     
    c = {}
    def foo(i):
        c[i]['foo start'] = perf_counter() - start
        sleep(5)
        # print(f'{i} - start:{start} end:{perf_counter()}')
        c[i]['foo end'] = perf_counter() - start
    
    class Test(Thread):
        def __init__(self,*args,**kwargs):
            self.i = kwargs['args'][0]
            super().__init__(*args,**kwargs)
        def run(self):
            # print(f'{self.i} - started:{perf_counter()}')
            c[self.i]['thread start'] = perf_counter() - start
            super().run()
        def join(self):
            # print(f'{self.i} - joined:{perf_counter()}')
            c[self.i]['thread joined'] = perf_counter() - start
            super().join()
    
    threads = []
    start = perf_counter()
    for i in range(10):
        c[i] = {}
        t = Test(target=foo,args=(i,))
        t.start()
        threads.append(t)
    
    for i in threads:
        i.join()
    
    df = pd.DataFrame(c)
    print(df)
    

                          0         1         2         3         4         5         6         7         8         9
    thread start   0.000729  0.000928  0.001085  0.001245  0.001400  0.001568  0.001730  0.001885  0.002056  0.002215
    foo start      0.000732  0.000931  0.001088  0.001248  0.001402  0.001570  0.001732  0.001891  0.002058  0.002217
    thread joined  0.002228  5.008274  5.008300  5.008305  5.008323  5.008327  5.008330  5.008333  5.008336  5.008339
    foo end        5.008124  5.007982  5.007615  5.007829  5.007672  5.007899  5.007724  5.007758  5.008051  5.007549
    

    Hopefully you can see that all the threads are started in sequence very close together; once thread 0 is joined nothing else happens till it stops (foo ends) then each of the other threads are joined and terminate.

    Sometimes a thread terminates before it is even joined - for threads one plus foo ends before the thread is joined.