Search code examples
pythonmultithreadingweb-crawlerweb-mining

Fast internet crawler


I'd like to do perform data mining on a large scale. For this, I need a fast crawler. All I need is something to download a web page, extract links and follow them recursively, but without visiting the same url twice. Basically, I want to avoid looping.

I already wrote a crawler in python, but it's too slow. I'm not able to saturate a 100Mbit line with it. Top speed is ~40 urls/sec. and for some reason it's hard to get better results. It seems like a problem with python's multithreading/sockets. I also ran into problems with python's gargabe collector, but that was solvable. CPU isn't the bottleneck btw.

So, what should I use to write a crawler that is as fast as possible, and what's the best solution to avoid looping while crawling?

EDIT: The solution was to combine multiprocessing and threading modules. Spawn multiple processes with multiple threads per process for best effect. Spawning multiple threads in a single process is not effective and multiple processes with just one thread consume too much memory.


Solution

  • It sounds like you have a design problem more than a language problem. Try looking into the multiprocessing module for accessing more sites at the same time rather than threads. Also, consider getting some table to store your previously visited sites (a database maybe?).