Following is a thread pool executor that I have implemented in python 3.x
with ThreadPoolExecutor(max_workers=15) as ex:
f = open(filename, 'r', encoding='UTF-8')
results = {ex.submit(callreadline, files ): files for files in f.readlines() }
The results variable contains values in the following format:
words and their corresponding 200 dimensional embedding
You can see that the values are tuples. The first value is a word and the 2nd value is 200 dimensional array. The number of values are 400000 in total. So there are 400000 tuples.
Now what I want to do is create another thread pool executor that does the following task
{the:0,is:1,are:2,said:3,...…………….hello:399999}
I was using for loop with the following code
count = 0
word_to_idx = OrderedDict()
vectors = []
for future in results.result:
b = future.result()
word_to_idx[count] = b[0]
if(count == 0):
vectors = np.array([b[1]])
else:
vectors = np.append(vectors,np.array([b[1]]),axis=0)
count = count +1
At the end of the above function I returned word_to_idx and vectors which did the job. However, the looping of 400000 tuples and assigning one by one to variable took extremely long time(about 10 hours).
So I was thinking if there is a way of parallelizing this functionality as well using thread pool executor.
I was thinking of creating threads and then share a counter variable with each thread getting access to the shared variable one at a time. The thread would then increment that variable and then another thread will access the incremented counter. Could someone point me to the right direction?
Edit:
Here is the call readline function which works really fast as it is called with 15 workers:
def callreadline(line):
# word_to_idx = OrderedDict()
word_to_idx = OrderedDict()
vectors = []
vocabulary = None
word_to_idx = read_w2v_word(line.split(' ')[0])
try:
vectors = np.append(vectors, [np.array(line.split(' ')[1:])], axis=0)
except:
vectors = np.array(line.split(' ')[1:],dtype=float)
if vocabulary is not None:
word_to_idx, vectors = filter_words(word_to_idx, vectors, vocabulary)
return word_to_idx,vectors
I have a feeling the callreadline function also isn't even close to being as fast as it could be, but it wasn't part of the question, so let me try to fix the rest for you:
with ThreadPoolExecutor(max_workers=15) as ex:
f = open(filename, 'r', encoding='UTF-8')
results = [ex.submit(callreadline, files) for files in f.readlines()]
word_to_idx = dict()
vectors = []
for count, future in enumerate(results):
b = future.result()
word_to_idx[b[0]] = count
vectors.append(b[1])
vectors = np.array(vectors)