I want to build a crawler which can update hundreds of thousands of links in several minutes. Is there any mature ways to do the scheduling? Is distributed system needed? What is the greatest barrier that limits the performance? Thx.
For Python you could go with Frontera by Scrapinghub
https://github.com/scrapinghub/frontera
They're the same guys that make Scrapy.
There's also Apache Nutch which is a much older project. http://nutch.apache.org/