I'm looking for indications on how much resources (CPU and RAM mainly) I should dedicate to my crawler to be able to smoothly crawl ~1M pages per hour. I run everything on a single node and use ES for persistency. I do a recursive crawl within 1 M domains. Thank you!
In general FAQ, the speed depends to a large extent on the diversity of hostnames and the politeness settings. In your case, there is no shortage of hostnames, so that's not the limit.
With ES as a backend, the bottlenecks tend to be the query times in the spouts and also the merging of the segments. As your crawl grows in size, these take longer and longer. There are various ways in which you can optimise things e.g. use sampling with the AggregationSpouts. Giving loads of RAM to ES would help and so would using SSDs. You could tweak the various params but to be honest, 1M per hour on a single server sounds very ambitious with ES as a backend. The faster you crawl, the more URLs you discover, the bigger your index becomes.
Do you plan to revisit URLs at all or is it a one-off crawl?
Could you please get in touch by email? I'd like to discuss this as it is pertaining to some work I am doing at the moment (and I am always curious about what people do with SC). Thanks.