I'm using 1.8 version of apache nutch. I want to save crawled HTML content to postgre database to do this, I modify FetcherThread.java
class as below.
case ProtocolStatus.SUCCESS: // got a page
pstatus = output(fit.url, fit.datum, content, status,
CrawlDatum.STATUS_FETCH_SUCCESS, fit.outlinkDepth);
updateStatus(content.getContent().length);
/*Added My code Here*/
But I want to use plug-in system instead of directly modifying FetcherThread class. To use plug-in system which extension points I need to use?
You could write a custom plugin and implement an extension of org.apache.nutch.indexer.IndexWriter to send the documents to Postgres as part of the indexing step. You'll need to index the raw content which requires NUTCH-2032 - this is in Nutch 1.11 so you will need to upgrade your version of Nutch.
Alternatively you could write a custom MapReduce job which would take a segments as input, read the content and send it to your DB in the reduce step.