I have a pretty big pandas dataframe and I want to select some rows based on conditions.
The problem is that the act of saving as CSV is separate from the overall flow of the program and consumes quite a bit of time.
Is it possible to separate the threads so that the main thread progresses to the selected rows, while at the same time unselected rows are saved as csv in another thread?
such as...
# This is pseudo code
import pandas as pd
df = pd.DataFrame({"col1":[x for x in range(10000)], "col2":[x**2 for x in range(0, 10000)]})
df_selected = df[df.apply(lambda x: x.col1%3==0, axis=1)]
df_unselected = df[df.apply(lambda x: x.col1%3!=0, axis=1)]
def Other_thread_save_to_csv(df:pd.DataFrame):
# this function is the last function to use df_unselected .
Other_thread_save_to_csv(df_unselected )
all_other_hadlings(df_selected )
Yes, Python's either threading or multiprocessing features are handy for concurrent tasks like saving a DataFrame to CSV while doing other tasks.
There are things you need to consider while working with threads and multiprocessing in python:
Global Interpreter Lock (GIL) in Python: This means threading may not always speed up CPU-heavy tasks. But for I/O tasks (like file writing), it's quite good to use.
Use Multiprocessing for Heavy CPU Tasks: If your other DataFrame tasks are CPU-intensive, multiprocessing is a better choice than threading.
and the last would be the Thread Safety, You have to Make sure no other thread is altering the DataFrame when you're writing it to a CSV.
# This is pseudo code
import pandas as pd
import threading
def save_to_csv(df, filename):
df.to_csv(filename, index=False)
df = pd.DataFrame({"col1": [x for x in range(10000)], "col2": [x**2 for x in range(10000)]})
df_selected = df[df["col1"] % 3 == 0]
df_unselected = df[df["col1"] % 3 != 0]
# Initiating a thread to save a portion of DataFrame
thread = threading.Thread(target=save_to_csv, args=(df_unselected, 'unselected_rows.csv'))
thread.start()
# Continue other tasks with the main thread
# additional_operations(df_selected)
# Optionally, wait for the thread to complete
thread.join()
save_to_csv
function runs on a separate thread, allowing your program to process df_selected
while df_unselected
gets saved in the background.