Search code examples
multithreadingpython-3.xweb-scrapingmultiprocessingruntime-compilation

Python - using multi-processing / multi threading for web scraping


I am using wikipedia python package to scrape data of a particular topic

q=['NASA', 'NASA_insignia', 'NASA_spinoff_technologies', 'NASA_facilities', 'NASA_Pathfinder', 'List_of_NASA_missions', 'Langley_Research_Center', 'NASA-TLX', 'Budget_of_NASA', 'NASA_(disambiguation)']

Example above, I've searched for NASA. Now I need to obtain the summary for each of the element in the list.

ny = []
for i in range(len(q)):
    y = wikipedia.page(q[i])
    x = y.summary
    ny.append(x)

In doing this whole process i.e. traversing each element of list and retrieving summary of each element, it's taking almost 40-60 seconds for the entire process to be completed (even with a good network connection)

I don't know much about multiprocessing / multithreading. How can i speed up the execution by a considerable time? Any help will be appreciated.


Solution

  • You can use a processing pool (see documentation).

    Here is an example based on your code:

    from multiprocessing import Pool
    
    
    q = ['NASA', 'NASA_insignia', 'NASA_spinoff_technologies', 'NASA_facilities', 'NASA_Pathfinder',
         'List_of_NASA_missions', 'Langley_Research_Center', 'NASA-TLX', 'Budget_of_NASA', 'NASA_(disambiguation)']
    
    def f(q_i):
        y = wikipedia.page(q_i)
        return y.summary
    
    with Pool(5) as p:
        ny = p.map(f, q)
    

    Basically f is applied for each element in q in separate processes. You can determine the number of processes when defining the pool (5 in my example).