Search code examples
pythonpandasdataframepython-multithreading

Python Threads with Pandas Dataframe does not improve performance


i have a Dataframe of 200k lines, i want to split into parts and call my function S_Function for each partition.

def S_Function(df):
    #mycode here 
    return new_df

Main program

N_Threads = 10
Threads = []
Out = []

size = df.shape[0] // N_Threads

for i in range(N_Threads + 1):

    begin = i * size
    end = min(df.shape[0], (i+1)*size)
    Threads.append(Thread(target = S_Function, args = (df[begin:end])) )

I run the threads & make the join :

for i in range(N_Threads + 1):
    Threads[i].start()

for i in range(N_Threads + 1):
    Out.append(Threads[i].join())

output = pd.concat(Out)

The code is working perfectly but the problem is that using threading.Thread did not decrease the execution time.
Sequential Code : 16 minutes
Parallel Code : 15 minutes

Can someone explain what to improve, why this is not working well?


Solution

  • Don't use threading when you have to process CPU-bound operations. To achieve your goal, I think you should use multiprocessing module

    Try:

    import pandas as pd
    import numpy as np
    import multiprocessing
    import time
    import functools
    
    # Modify here
    CHUNKSIZE = 20000
    
    def S_Function(df, dictionnary):
        # do stuff here
        new_df = df
        return new_df
    
    
    if __name__ == '__main__':
        # Load your dataframe
        df = pd.DataFrame({'A': np.random.randint(1, 30000000, 200000).tolist()})
    
        # Create chunks to process
        chunks = (df[i:i+CHUNKSIZE] for i in range(0, len(df), CHUNKSIZE))
        dictionnary = {'k1': 'v1', 'k2': 'v2'}
        s_func = functools.partial(S_Function, dictionnary=dictionnary)
    
        start = time.time()
        with multiprocessing.Pool(multiprocessing.cpu_count()) as pool:
            data = pool.map(s_func, chunks)
            out = pd.concat(data)
        end = time.time()
    
        print(f"Elapsed time: {end - start:.2f} seconds")