So I'm trying to turn on the Deletion Bolt on my storm crawler instances so they can clean up the indexes as the urls for our sites change and pages go away.
For reference I am on 1.13. (our systems people have not upgraded us to Elk v7 yet)
Having never attempted to modify the es-crawler.flux, I'm looking for some help to let me know if I am doing this correctly.
I added a bolt:
- id: "deleter"
className: "com.digitalpebble.stormcrawler.elasticsearch.bolt.DeletionBolt"
parallelism: 1
and then added the stream:
- from: "status"
to: "deleter"
grouping:
type: FIELDS
args: ["url"]
streamId: "deletion"
Is that the correct way to do this? I don't want to accidentally delete everything in my index by putting in the wrong info. 🤣
Yes, to answer my own question, adding the two above items to their respective places in the es-crawler.flux DOES in fact cause the crawler to delete docs.
In order to test this, I created a directory on one of our servers with a few files in it - index.html, test1.html, test2.html, and test3.html. index.html had links to the three test html files. I crawled them with the crawler, having first limited it to ONLY crawl that specific directory. I also modified the fetch settings to re-crawl crawled docs after 3 min, and re-crawl fetch-error docs after 5min.
All 4 docs showed up in the status index as FETCHED
and the content in the content index.
I then renamed test3.html to test5.html, and changed the link in the index.html. The crawler picked up the change, and and changed the status of test3.html to FETCH_ERROR
and added test4.html to the indexes.
After 5min it crawled it again, keeping the fetch error status.
After another 5min, it crawled it again, changing the status to ERROR
and deleting the test3.html doc from the content index.
So that worked great. In our production indexes, we have a bunch of docs that have gone from FETCH_ERROR
status to ERROR
status, but because deletions were not enabled, the actual content was not deleted and is still showing up in searches. On my test pages, here's the solution to that:
I disabled deletions (removing the two above items from the es-crawler.flux) and renamed test2.html to test5.html, modifing the link in the index.html. The crawler went through the three crawls with FETCH_ERROR
and set it to ERROR
status but did not delete the doc from the content index.
I re-enabled deletions and let the crawler run for a while, but soon realized that when the crawler set the status to ERROR
, it also set the nextFetchDate
to 12/31/2099.
So I went into the elasticsearch index and ran the following query to reset the status and the date to something just ahead of where the current date/time was:
POST /www-test-status/_update_by_query
{
"script": {
"source": """
if (ctx._source?.status != null)
{
ctx._source.remove('metadata.error%2Ecause');
ctx._source.remove('status');
ctx._source.put('status', 'FETCH_ERROR');
ctx._source.remove('nextFetchDate');
ctx._source.put('nextFetchDate', '2019-10-09T15:01:33.000Z');
}
""",
"lang": "painless"
},
"query": {
"match": {
"status": "ERROR"
}
}
}
The crawler then picked up the docs the next time it came around and deleted the docs out of the content index when they went back to ERROR
status.
Not sure if that's the complete proper way to do it, but it has worked for me.