Search code examples
web-crawlerstormcrawler

Archiving old websites with StormCrawler and Elasticsearch


when storm crawler re-visits a website which has already been fetched before, it updates the corresponding document in the elasticsearch index. I.e., the old content is overwritten by the new one.

Is there any stormcrawler functionality which allows us to keep the old version of certain fields and annotate it with a timestamp?

We looked into the elasticsearch rollover api and ingest pipelines. The ingest pipelines look promising to modify elasticsearch documents on update operations. Is there any way to append the pipeline parameter (i.e., ?pipeline=xxx) via the stormcrawler configuration to relevant elasticsearch requests?


Solution

  • One option could be to use the URL + timestamp as key and store each version of the document separately. You'd have to deduplicate at search time though. This would need a minor change to the code.

    We can't currently append params via the config, but it should be doable. I never used the pipelines in ES, can't they be configured to be used by default on a particular index?