Search code examples
python-3.xweb-scrapingweb-crawlerpython-multiprocessingpython-multithreading

Use multiprocessing or multithreading to improve scraping speed in Python


I have a crawler code as follows:

import requests
import json
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
import re
from datetime import datetime

def crawl(id):
    try:
        url = 'https://www.china0001.com.cn/project/{0:06d}.html'.format(id)
        print(url)
        content = requests.get(url).text
        soup = BeautifulSoup(content, 'lxml')
        tbody = soup.find("table", attrs={"id":"mse_new"}).find("tbody", attrs={"class":"jg"})
        tr = tbody.find_all("tr")
        rows = []
        for i in tr[1:]:
            rows.append([j.text.strip() for j in i.findAll("td")])
        out = dict([map(str.strip, y.split(':')) for x in rows for y in x])
        return out

    except AttributeError:
        return False

data = list()
for id in range(699998, 700010):
    print(id)
    res = crawl(id)
    if res:
        data.append(res)

if len(data) > 0:
    df = pd.DataFrame(data)
    df.to_excel('test.xlsx', index = False)

It works, but it takes very long time when i change the range interval larger, so I think maybe I need to use multithreading or multiprocessing or split range() into multiple blocks to improve scraping speed, but I don't know how to do that.

Anyone could help? Thanks a lot at advance.

Updates:

MAX_WORKERS = 20 #play with it to get an optimal value
ids = list(range(699998, 700050))
workers = min(MAX_WORKERS, len(ids))

data = list()
for id in ids:
    print(id)
    res = crawl(id)
    if res:
        with futures.ThreadPoolExecutor(workers) as executor:
            res = executor.map(crawl, ids)
            data.append(res)

Solution

  • Here is how I would go about it:

    from concurrent import futures
    
    MAX_WORKERS = 20 #play with it to get an optimal value
    ids = list(range(0, 10000))
    workers = min(MAX_WORKERS, len(ids))
    data = list()
    
    
    with futures.ThreadPoolExecutor(workers) as executor:
        res = executor.map(crawl, ids)
        data.append(res)
    

    I haven't tested it though. But hopefully it helps you get a direction to explore.