Search code examples
elasticsearchweb-crawlerstormcrawler

Stormcrawler - how does the es.status.filterQuery work?


I am using stormcrawler to put data into some Elasticsearch indexes, and I have a bunch of URL's in the status index, with a variety of statuses - DISCOVERED, FETCHED, ERROR, etc.

I was wondering if I could tell StormCrawler to just crawl the urls that are https and with the status: DISCOVERED and if that would actually work. I have the es-conf.yaml set as follows:

es.status.filterQuery: "-(url:https* AND status:DISCOVERED)"

Is that correct? how does SC make use of the es.status.filterQuery? Does it run a search and apply the value as a filter to retrieve only the applicable documents to fetch?


Solution

  • See code of the AggregationSpout.

    how does SC make use of the es.status.filterQuery? Does it run a search and apply the value as a filter to retrieve only the applicable documents to fetch?

    yes, it filters the queries sent to the ES shards. This is useful for instance to process a subset of a crawl.

    It is a positive filter i.e. the documents must match the query in order to be retrieved; you'd need to remove the - for it to do what you described.