I am using stormcrawler to put data into some Elasticsearch indexes, and I have a bunch of URL's in the status index, with a variety of statuses - DISCOVERED, FETCHED, ERROR, etc.
I was wondering if I could tell StormCrawler to just crawl the urls that are https and with the status: DISCOVERED and if that would actually work. I have the es-conf.yaml set as follows:
es.status.filterQuery: "-(url:https* AND status:DISCOVERED)"
Is that correct? how does SC make use of the es.status.filterQuery? Does it run a search and apply the value as a filter to retrieve only the applicable documents to fetch?
See code of the AggregationSpout.
how does SC make use of the es.status.filterQuery? Does it run a search and apply the value as a filter to retrieve only the applicable documents to fetch?
yes, it filters the queries sent to the ES shards. This is useful for instance to process a subset of a crawl.
It is a positive filter i.e. the documents must match the query in order to be retrieved; you'd need to remove the - for it to do what you described.