python-3.x web-scraping beautifulsoup parallel-processing urllib

How to parallel download and parse HTML files using Python 3?

I was trying to download a long list of HTML files from the internet onto my computer, and then use BeautifulSoup to scrape those files from my computer. It's a long story why I want to first save them onto my computer before scraping, so I'll save you the trouble by not writing an essay!

Anyway, for me, requests module is too slow when dealing with many URL's, so I decided to stick with urllib and use multiprocessing/threadpooling to make the request functions run in parallel (so it's quicker than requesting each file one after another).

My problem is: what I want to do is save each HTML/URL independently - that is, I want to store each HTML file separately, instead of writing all of the HTML's into one file. While multiprocessing and urllib can request HTML's in parallel, I couldn't find out how to download (or save/write to txt) each HTML separately.

I'm imagining something like the general example I just made up below, where each request within the parallel function will get preformed in parallel.

parallel(

request1
request2
request3
...

)

The reason for wanting it to be like this is so that I can use the same simple script structure for the next step: parsing the HTML's with BeautifulSoup. Like how I had separate request functions for each URL on the first part, I'll need separate parse functions for each HTML because each HTML's structure is different. If you have a different solution, that's okay as well, I'm just trying to explain my thought; it doesn't have to be like this.

Is it possible to do this (both requesting separately and parsing separately) using multiprocessing (or any other libraries)? I spend the whole day yesterday on StackOverflow trying to find similar questions, but many involve using complex things like eventlet or scrapy, and none mention downloading each HTML into separate files and parsing them individually, but in parallel.

Solution

It's possible for sure (: Just write single thread function which will do all you need from start to finish and then execute it in multiprocessing pool eg.

from multiprocessing import Pool

def my_function(url_to_parse):
    request()...
    parse()...
    save_with_unique_filename()
    return result[optional]

NUM_OF_PROCS = 10
pool = Pool(NUM_OF_PROCS)
pool.map(my_function, [list_of_urls_to_parse])