Search code examples
web-crawlerstormcrawler

Deleting the Fetched records automatically when Fetch_Error occurs


Working on Storm Crawler 1.13,ran crawler successfully on a website and one of the page got deleted on the website and as per the crawler-conf on next re-visit the status index updated as FETCH_ERROR for the missing url and when I check in the main index the record is still there with that url. How can I delete that record automatically whenever the FETCH_ERROR appears.


Solution

  • The FETCH_ERROR status gets converted into an ERROR after a number of successive attempts (set by fetch.error.count). Once it does, a tuple is sent on the deletion stream by the AbstractStatusUpdaterBolt and if you have a DeletionBolt connected, then the URL will be removed from the content index of Elasticsearch. It will remain in the status index though and revisited or not based on the scheduling for ERRORs.