We have lots of sites being updated, added, and deleted. I'm curious as to how Stormcrawler handles a site with a url that has been previously "FETCHED", when the next time SC reaches it it has been removed and either generates a redirect or a 404. What happens to the content that is from the old version of the page, in the "Index" index?
I know the url in the "Status" index probably changes to "REDIRECTION" or "FETCH ERROR" or something, but what about the content itself? Is it deleted? Is it left? I am trying to figure out how SC reacts here and if I have to work at cleaning up these orphaned docs in the "Index" index.
I would expect SC to delete the content if it's no longer there, but I thought I would ask to be sure.
As you pointed out, a missing URL will get a FETCH_ERROR status, which after being retried a number of times (param max.fetch.errors - default 3) will turn into an ERROR status.
The content will be deleted if you connect a DeletionBolt to the status updater, see example topology.