I need to implement multithreading to a Python job.
I have a dictionary and each key in that dictionary (out of about 40) is a timestamped pandas dataframe. Most of the dataframes have 100,000+ rows. Their timestamps are strings in "%Y-%m-%d %H:%M:%S"
format.
To convert the timestamped strings I use the following function:
def to_dt(df):
df['timestamp'] = df['timestamp'].map(lambda n: pd.to_datetime(n, format='%Y-%m-%d %H:%M:%S'))
return df
So I would like to put each process to_dt(df)
in a separate thread. How can I do that?
To simplify let's consider we have the following setup:
def to_dt(df):
df['timestamp'] = df['timestamp'].map(lambda n: pd.to_datetime(n, format='%Y-%m-%d %H:%M:%S'))
return df
# empty dictionary
d_test = {}
# dataframe with single string timestamp column
df = pd.DataFrame(columns=['st_dt'])
# populate dataframe with 1000 timestamp rows
for i in range(1000):
df.loc[len(df)] = ['2018-10-02 10:00:00']
# add 20 instances of the dataframe to the dictionary with keys in format "a0" to 'a19'
for i in range(20):
d_test['a'+str(i)] = df
Now how can we make each iteration of
for i in range(20):
to_dt(d_test['a'+str(i)])
to run in a separate thread?
Due to the existence of GIL, one and only one thread is running at any time in Python, so multithreading in this case would only make the performance worse.
In order to use multiple cores, you need multiprocessing instead of multithreading, but the heavy overhead of spawning a new process will surely overtake the benefit, so it's better to use a single pd.to_datetime
in your case.
Also this post explains GIL quite well.