Search code examples
elasticsearchweb-crawlerstormcrawler

What happens when a previously "FETCHED" url is removed on the web server side and StormCrawler goes to it again?


We have lots of sites being updated, added, and deleted. I'm curious as to how Stormcrawler handles a site with a url that has been previously "FETCHED", when the next time SC reaches it it has been removed and either generates a redirect or a 404. What happens to the content that is from the old version of the page, in the "Index" index?

I know the url in the "Status" index probably changes to "REDIRECTION" or "FETCH ERROR" or something, but what about the content itself? Is it deleted? Is it left? I am trying to figure out how SC reacts here and if I have to work at cleaning up these orphaned docs in the "Index" index.

I would expect SC to delete the content if it's no longer there, but I thought I would ask to be sure.


Solution

  • As you pointed out, a missing URL will get a FETCH_ERROR status, which after being retried a number of times (param max.fetch.errors - default 3) will turn into an ERROR status.

    The content will be deleted if you connect a DeletionBolt to the status updater, see example topology.