Search code examples
web-crawlerstormcrawler

Redirections handling in Storm-Crawler


With SC, should I be able to follow redirections without emitting outlinks? Should the redirected URL be injected in my backend as "DISCOVERED" or not? It seems not from my small experiments with the following setup:

crawler.yaml:        redirections.allowed: true
                     parser.emitOutlinks: false
urlfilters.json:     "maxDepth": 2

Finally, when a page is seen as redirecting to another one, will it go through the rest of the topology for that page (I mean whatever is behing the fetcher) or not?


Solution

  • The outlinks and redirections are handled separately, see JSoupParserBolt.java#L341. Most redirections happen in the FetcherBolt where the emitoulinks config does not apply anyway.

    The target of the redirection will have a status of DISCOVERED unless it already exists with a different status.

    Bear in mind that redirected URLs go through filtering and normalisation just like any outlink so there could be something there preventing the URLs from being added e.g. filter on hostname.

    Finally, when a page is seen as redirecting to another one, will it go through the rest of the topology for that page (I mean whatever is behind the fetcher) or not?

    No, see FetcherBolt