Search code examples
javaapacheweb-scrapingnutchinformation-retrieval

How to save fetched html content to database in apache nutch?


I'm using 1.8 version of apache nutch. I want to save crawled HTML content to postgre database to do this, I modify FetcherThread.java class as below.

  case ProtocolStatus.SUCCESS: // got a page
  pstatus = output(fit.url, fit.datum, content, status,
  CrawlDatum.STATUS_FETCH_SUCCESS, fit.outlinkDepth);
  updateStatus(content.getContent().length);              
  /*Added My code Here*/

But I want to use plug-in system instead of directly modifying FetcherThread class. To use plug-in system which extension points I need to use?


Solution

  • You could write a custom plugin and implement an extension of org.apache.nutch.indexer.IndexWriter to send the documents to Postgres as part of the indexing step. You'll need to index the raw content which requires NUTCH-2032 - this is in Nutch 1.11 so you will need to upgrade your version of Nutch.

    Alternatively you could write a custom MapReduce job which would take a segments as input, read the content and send it to your DB in the reduce step.