Search code examples
pythonmultithreadingpandasstring-to-datetime

Python Multithreading Implementation of to_datetime Function


I need to implement multithreading to a Python job.

I have a dictionary and each key in that dictionary (out of about 40) is a timestamped pandas dataframe. Most of the dataframes have 100,000+ rows. Their timestamps are strings in "%Y-%m-%d %H:%M:%S" format.

To convert the timestamped strings I use the following function:

def to_dt(df):
    df['timestamp'] = df['timestamp'].map(lambda n: pd.to_datetime(n, format='%Y-%m-%d %H:%M:%S'))
    return df

So I would like to put each process to_dt(df) in a separate thread. How can I do that?

To simplify let's consider we have the following setup:

def to_dt(df):
    df['timestamp'] = df['timestamp'].map(lambda n: pd.to_datetime(n, format='%Y-%m-%d %H:%M:%S'))
    return df

# empty dictionary
d_test = {}

# dataframe with single string timestamp column
df = pd.DataFrame(columns=['st_dt'])

# populate dataframe with 1000 timestamp rows
for i in range(1000):
    df.loc[len(df)] = ['2018-10-02 10:00:00']

# add 20 instances of the dataframe to the dictionary with keys in format "a0" to 'a19'
for i in range(20):
    d_test['a'+str(i)] = df

Now how can we make each iteration of

for i in range(20): to_dt(d_test['a'+str(i)])

to run in a separate thread?


Solution

  • Due to the existence of GIL, one and only one thread is running at any time in Python, so multithreading in this case would only make the performance worse.

    In order to use multiple cores, you need multiprocessing instead of multithreading, but the heavy overhead of spawning a new process will surely overtake the benefit, so it's better to use a single pd.to_datetime in your case.

    Also this post explains GIL quite well.