Search code examples
web-crawlerstormcrawler

How to seed URLs as a text file in StormCrawler?


I have numerous URLs (about 40,000) that need to get crawled using StormCrawler. Is there any way that I pass these URLs as a text file instead of a list in crawler.flux? Something like this:

spouts:
  - id: "spout"
    className: "com.digitalpebble.stormcrawler.spout.MemorySpout"
    parallelism: 1
    constructorArgs:
      - "URLs.txt"

Solution

  • There is a FileSpout exactly for that purpose. It is used by the topologies mentioned by @sebastian-nagel and you can use them in yours just as well, see for example this topology.