Search code examples
postgresqlweb-crawlerstormcrawler

How to setup storm crawler with Postgres?


I'm trying to set up the stormcrawler with a postgres sql database as the backend. But there is no documentation on what tables need to exist to be able to start the storm crawler.

What tables do I need and which columns do they have? Or is there some way to automatically create the required tables? Also how do I start the crawler in this mode? because I cannot send in a seed url like the example crawler topology.


Solution

  • See tableCreationScript. For the injection of URLs, you can either add them yourself to the table with an insert, as shown in this tutorial, or reuse the injection topology from the elasticsearch module and specify the statusupdaterbolt from the mysql module instead. Another approach could be to simply to add a MemorySpout to the topology alongside the SQLSpout.