python multithreading screen-scraping urllib

Python Urllib UrlOpen Read

Say I am retrieving a list of Urls from a server using Urllib2 library from Python. I noticed that it took about 5 seconds to get one page and it would take a long time to finish all the pages I want to collect.

I am thinking out of those 5 seconds. Most of the time was consumed on the server side and I am wondering could I just start using the threading library. Say 5 threads in this case, then the average time could be dramatically increased. Maybe 1 or 2 seconds in each page. (might make the server a bit busy). How could I optimize the number of threads so I could get a legit speed and not pushing the server too hard.

Thanks!

Updated: I increased the number of threads one by one and monitored the total time (units: minutes) spent to scrape 100 URLs. and it turned out that the total time dramatically decreased when you change the number of threads to 2, and keep decreasing as you increase the number of threads, but the 'improvement' caused by threading become less and less obvious. (the total time even shows a bounce back when you build too many threads) I know this is only a specific case for the web server that I harvest but I decided to share just to show the power of threading and hope would be helpful for somebody one day.

enter image description here

Solution

There are a few things you can do. If the URLs are on different domains, then you might just fan out the work to threads, each downloading a page from a different domain.

If your URLs all point to the same server and you do not want stress the server, then you can just retrieve the URLs sequentially. If the server is happy with a couple of parallel requests, the you can look into pools of workers. You could start, say a pool of four workers and add all your URL to a queue, from which the workers will pull new URLs.

Since you tagged the question with "screen-scraping" as well, scrapy is a dedicated scraping framework, which can work in parallel.

Python 3 comes with a set of new builtin concurrency primitives under concurrent.futures.