Search code examples
web-crawlerstormcrawler

What are the implications of not tracking the url.path in StormCrawler?


We're using StormCrawler and storing our Status index in elasticsearch. This index is getting pretty big (almost 3 billion docs!) and so the shards are also big to backup etc.

I'm considering removing the url.path metadata array element in the docs. It looks like I can disable it with metadata.track.path.

What are the implications if I were to no longer index this and delete what I have?


Solution

  • If you are not interested in tracking how a particular URL has been found then yes, you'd save space (and a bit of time) by setting metadata.track.path to false. You can do that straight away and any new documents won't have the corresponding field.

    Not sure what you mean by 'delete what I have' - you can't delete just one field, you'd have to delete and reindex the whole documents.

    As a rule, make sure you index only the fields you need. See this customised version of the ES index init script where 'hostname' has been moved out of the fields prefixed with metadata in order to be searchable. The options available depend on the version of Elasticsearch that you are using.