Search code examples
pythonweb-scrapingconcurrent.futures

Concurrent futures webscraping


I am currently trying to develop a fast webscraping function so I can scrape a large list of files.

This is the code I have currently:

import time
import requests
from bs4 import BeautifulSoup
from concurrent.futures import ProcessPoolExecutor, as_completed
def parse(url):
    r = requests.get(url)
    soup = BeautifulSoup(r.content, 'lxml')
    return soup.find_all('a')
with ProcessPoolExecutor(max_workers=4) as executor:
    start = time.time()
    futures = [ executor.submit(parse, url) for url in URLs ]
    results = []
    for result in as_completed(futures):
        results.append(result)
    end = time.time()
    print("Time Taken: {:.6f}s".format(end-start))

this brings backs results for websites i.e www.google.com, however my problem is I have no idea to view the data it brings back I only get future objects.

Please can someone explain/show me how to do this.

I appreciate anytime you give to help me with this.


Solution

  • You can implement it by dict comprehension also, like below.

    with ProcessPoolExecutor(max_workers=4) as executor:
    
        start = time.time()
        futures = { executor.submit(parse, url): url for url in URLs }
        for result in as_completed(futures):
            link = futures.get(result)
            try:
                data = result.result()
            except Exception as e:
                print(e)
            else:
                print("Link: {}, data: {}".format(link, data))
        end = time.time()
        print("Time Taken: {:.6f}s".format(end-start))