Search code examples
pythonimportmultiprocessingpool

in Python, how to implement (multiprocessing pool)


in this part of scraping code , I fetch alot of URLs from stored URLs in (url.xml) file and it is take so long to finish, how to implement (multiprocessing pool)

any simple code to fix this problem ? Thanks

from bs4 import BeautifulSoup as soup
import requests
from multiprocessing import Pool

p = Pool(10) # “10” means that 10 URLs will be processed at the same time
p.map

page_url = "url.xml"


out_filename = "prices.csv"
headers = "availableOffers,otherpricess,currentprice \n"

with open(out_filename, "w") as fw:
  fw.write(headers)
  with open("url.xml", "r") as fr:
    for url in map(lambda x: x.strip(), fr.readlines()): 
      print(url)
      response = requests.get(url)
      page_soup = soup(response.text, "html.parser")


      availableOffers = page_soup.find("input", {"id": "availableOffers"})
      otherpricess = page_soup.find("span", {"class": "price"})
      currentprice = page_soup.find("div", {"class": "is"})

      fw.write(availableOffers + ", " + otherpricess + ", " + currentprice + "\n")


p.terminate()
p.join()

Solution

  • You can use concurrent.futures standard package in python for multiprocessing and multi-threading.

    In, your case, you don't need multiprocessing, multi-threading will help. Because, your function in computationally expensive.

    By, use of multi-threading, you can send multiple request at same time. number_of_threads argument can control the number of the request, you want to send at a time.

    I have created a function, extract_data_from_url_func that will extract the data from single URL and i pass this function and list of URLS to multi-threading executor using concurrent.futures.

    from bs4 import BeautifulSoup as soup
    from concurrent.futures import ThreadPoolExecutor
    import requests
    
    page_url = "url.xml"
    number_of_threads = 6
    out_filename = "prices.csv"
    headers = "availableOffers,otherpricess,currentprice \n"
    
    def extract_data_from_url_func(url):
        print(url)
        response = requests.get(url)
        page_soup = soup(response.text, "html.parser")
        availableOffers = page_soup.find("input", {"id": "availableOffers"})["value"]
        otherpricess = page_soup.find("span", {"class": "price"}).text.replace("$", "")
        currentprice = page_soup.find("div", {"class": "is"}).text.strip().replace("$", "")
        output_list = [availableOffers, otherpricess, currentprice]
        output = ",".join(output_list)
        print(output)
        return output
    
    with open("url.xml", "r") as fr:
        URLS = list(map(lambda x: x.strip(), fr.readlines()))
    
    with ThreadPoolExecutor(max_workers=number_of_threads) as executor:
        results = executor.map( extract_data_from_url_func, URLS)
        responses = []
        for result in results:
            responses.append(result)
    
    
    with open(out_filename, "w") as fw:
      fw.write(headers)
      for response in responses:
          fw.write(response)
    

    reference: https://docs.python.org/3/library/concurrent.futures.html