Search code examples
nutch

What database does Apache Nutch use for storing URLs?


I tried to look into its dependencies (see here) but I fail to figure what it uses for storing URLs and handling the progress of the crawl. Judging by the tutorial requirements (see here) it doesn't need any 3rd party system, like some SQL database.

So what does it use?

Thanks for any suggestion!


Solution

  • Nutch 1.x stores the data in Hadoop MapFiles and SequenceFiles. Apache Nutch is a batch-based crawler and the data is

    • either write-once/read-many as for the segments created and filled in every crawl cycle
    • or rewritten when new data is added: the "CrawlDb" which holds the URLs and status information (fetch status and date, signature / checksum, score, metadata)

    Nutch 2.x (retired) put all data into a single "web table" - with scale-up and distribution delegated to big data stores (HBase, etc.) via Apache Gora.