Search code examples
pythonfor-looppython-requestsiterationthreadpool

ConcurrentFutures ThreadPoolExecuter not finishing pd.DataFrame.append


when working with python ThreadPoolExecutor and iterating through a list with performing some network requests, I'm facing an issue that my workers aren't finishing before the task is marked as complete.

If you perform the same task with for loop and ThreadPoolExecutor, the length of my DataFrame is varying with ThreadPoolExecutor. For loop is always performing all tasks.

Is there a problem, or anything to add to ThreadPoolExecutor to work properly?

import pandas as pd
import time
import concurrent.futures


columns = ['name']
data = pd.DataFrame(columns = columns)
persons = ['Tom', 'Mike', 'Susan', 'David', 'Ellen']

def update(person):
    global data
    time.sleep(0.2)
    data = data.append(pd.DataFrame({'name': person}, index=[person]))


for x in persons:
    update(x)
print(len(data))
data = pd.DataFrame(columns = columns)

with concurrent.futures.ThreadPoolExecutor() as executor:
    executor.map(update, persons)
print(len(data))

Solution

  • From the documentation:

    As of pandas 0.11, pandas is not 100% thread safe. The known issues relate to the copy() method. If you are doing a lot of copying of DataFrame objects shared among threads, we recommend holding locks inside the threads where the data copying occurs.