I am using wikipedia python package to scrape data of a particular topic
q=['NASA', 'NASA_insignia', 'NASA_spinoff_technologies', 'NASA_facilities', 'NASA_Pathfinder', 'List_of_NASA_missions', 'Langley_Research_Center', 'NASA-TLX', 'Budget_of_NASA', 'NASA_(disambiguation)']
Example above, I've searched for NASA. Now I need to obtain the summary for each of the element in the list.
ny = []
for i in range(len(q)):
y = wikipedia.page(q[i])
x = y.summary
ny.append(x)
In doing this whole process i.e. traversing each element of list and retrieving summary of each element, it's taking almost 40-60 seconds for the entire process to be completed (even with a good network connection)
I don't know much about multiprocessing / multithreading. How can i speed up the execution by a considerable time? Any help will be appreciated.
You can use a processing pool (see documentation).
Here is an example based on your code:
from multiprocessing import Pool
q = ['NASA', 'NASA_insignia', 'NASA_spinoff_technologies', 'NASA_facilities', 'NASA_Pathfinder',
'List_of_NASA_missions', 'Langley_Research_Center', 'NASA-TLX', 'Budget_of_NASA', 'NASA_(disambiguation)']
def f(q_i):
y = wikipedia.page(q_i)
return y.summary
with Pool(5) as p:
ny = p.map(f, q)
Basically f
is applied for each element in q
in separate processes.
You can determine the number of processes when defining the pool (5 in my example).