Search code examples
pythonpandasexport-to-csv

Can I run just pd.df.to_csv in a different thread?


I have a pretty big pandas dataframe and I want to select some rows based on conditions.

The problem is that the act of saving as CSV is separate from the overall flow of the program and consumes quite a bit of time.

Is it possible to separate the threads so that the main thread progresses to the selected rows, while at the same time unselected rows are saved as csv in another thread?

such as...

# This is pseudo code

import pandas as pd

df = pd.DataFrame({"col1":[x for x in range(10000)], "col2":[x**2 for x in range(0, 10000)]})

df_selected = df[df.apply(lambda x: x.col1%3==0, axis=1)] 
df_unselected = df[df.apply(lambda x: x.col1%3!=0, axis=1)] 


def Other_thread_save_to_csv(df:pd.DataFrame):
     # this function is the last function to use df_unselected .


Other_thread_save_to_csv(df_unselected )

all_other_hadlings(df_selected )


Solution

  • Yes, Python's either threading or multiprocessing features are handy for concurrent tasks like saving a DataFrame to CSV while doing other tasks.

    There are things you need to consider while working with threads and multiprocessing in python:

    • Global Interpreter Lock (GIL) in Python: This means threading may not always speed up CPU-heavy tasks. But for I/O tasks (like file writing), it's quite good to use.

    • Use Multiprocessing for Heavy CPU Tasks: If your other DataFrame tasks are CPU-intensive, multiprocessing is a better choice than threading.

    and the last would be the Thread Safety, You have to Make sure no other thread is altering the DataFrame when you're writing it to a CSV.

    # This is pseudo code
    
    import pandas as pd
    import threading
    
    def save_to_csv(df, filename):
        df.to_csv(filename, index=False)
    
    df = pd.DataFrame({"col1": [x for x in range(10000)], "col2": [x**2 for x in range(10000)]})
    
    df_selected = df[df["col1"] % 3 == 0]
    df_unselected = df[df["col1"] % 3 != 0]
    
    # Initiating a thread to save a portion of DataFrame
    thread = threading.Thread(target=save_to_csv, args=(df_unselected, 'unselected_rows.csv'))
    thread.start()
    
    # Continue other tasks with the main thread
    # additional_operations(df_selected)
    
    # Optionally, wait for the thread to complete
    thread.join()
    

    save_to_csv function runs on a separate thread, allowing your program to process df_selected while df_unselected gets saved in the background.