how StormCrawler identifies seed urls?

I am using storm crawler with mysql.

I have 100 seed urls but my buffer size is 50 only.

what will happen if the outlinks from some seeds fall in bucket number zero . In that case will those outlinks also be treated as seed?

how storm crawler diffrentiates seed urls from other urls?

Solution

Not sure I understand your question. There is no difference between seed URLs and non-seed ones. StormCrawler does not identify them in any particular way. The term seed URLs simply means that they are given to the crawler as a starting point.

The buckets are not used to prioritise the URLs or distinguish them, they are based on the hostname or domain so that the multiple spout instances can read them in parallel and guarantee a good diversity of sites for performance purposes.

The SQL module in StormCrawler is not as efficient as other backends such as the SOLR or Elasticsearch ones. It works fine with a few websites but is probably less efficient beyond that.