Search code examples
multithreadingasynchronousbeautifulsoupmultiprocessinggrequests

How to parse the response from Grequests faster?


I want to webscraping multiple urls and parse quick as possible but the for loop is not too faster for me, have a way to do this maybe with asynchronous or multiprocessing or multithreading?

import grequests
from bs4 import BeautifulSoup


links1 = [] #multiple links


while True:
  try:  
 
   reqs = (grequests.get(link) for link in links1)
   resp = grequests.imap(reqs, size=25, stream=False)
  

   for r in resp:     # I WANT TO RUN THIS FOR LOOP QUICK AS POSSIBLE ITS POSSIBLE? 
    soup = BeautifulSoup(r.text, 'lxml') 
    parse = soup.find('div', class_='txt')


Solution

  • Example how to use multiprocessing with requests/BeautifulSoup:

    import requests
    from tqdm import tqdm  # for pretty progress bar
    from bs4 import BeautifulSoup
    from multiprocessing import Pool
    
    # some 1000 links to analyze
    links1 = [
        "https://en.wikipedia.org/wiki/2021_Moroccan_general_election",
        "https://en.wikipedia.org/wiki/Tangerang_prison_fire",
        "https://en.wikipedia.org/wiki/COVID-19_pandemic",
        "https://en.wikipedia.org/wiki/Yolanda_Fern%C3%A1ndez_de_Cofi%C3%B1o",
    ] * 250
    
    
    def parse(url):
        soup = BeautifulSoup(requests.get(url).content, "html.parser")
        return soup.select_one("h1").get_text(strip=True)
    
    
    if __name__ == "__main__":
        with Pool() as p:
            out = []
            for r in tqdm(p.imap(parse, links1), total=len(links1)):
                out.append(r)
    
        print(len(out))
    

    With my internet connection/CPU (Ryzen 3700x) I was able to get results from all 1000 links in 30 seconds:

    100%|██████████| 1000/1000 [00:30<00:00, 33.12it/s]
    1000
    

    all my CPUs were utilized (screenshot from htop):

    enter image description here