I am trying to use concurrent.futures
to process a function with multiple threads to efficiently speed up the code.
I have read their documentation and this guide but believe I may not be doing this correctly. This MRE should allow us to test a number of different string lengths and list sizes to compare performance:
import pandas as pd, tqdm, string, random
from thefuzz import fuzz, process
from concurrent.futures import ThreadPoolExecutor
def generate_string(items=10, lengths=5):
return [''.join(random.choice(string.ascii_letters) for i in range (lengths))] * items
def matching(a, b):
matches = {}
scorers = {'token_sort_ratio': fuzz.token_sort_ratio, 'token_set_ratio': fuzz.token_set_ratio, 'partial_token_sort_ratio': fuzz.partial_token_sort_ratio,
'Quick': fuzz.QRatio, 'Unicode Quick': fuzz.UQRatio, 'Weighted': fuzz.WRatio, 'Unweighted': fuzz.UWRatio}
for x in tqdm.tqdm(a):
best = 0
for _, scorer in scorers.items():
res = process.extractOne(x, b, scorer=scorer)
if res[1] > best:
best = res[1]
matches[x] = res
else:
continue
return matches
list_a = generate_string(100, 10)
list_b = generate_string(10, 5)
with ThreadPoolExecutor(max_workers=5) as executor:
future = executor.submit(matching, list_a, list_b)
This code runs with no error; how can I use multiple workers to execute these loops in parallel so that the code will run faster?
Thanks to a hint from @Anentropic, I was able to use the following change with multiprocessing
if __name__ == '__main__':
list_a = generate_string(500, 10)
list_b = generate_string(500, 10)
pool = Pool(os.cpu_count()-2)
res = pool.map(matching, zip(list_a, list_b))
norm_res = matching([list_a, list_b])