In the ES topology I would like to index urls in ElasticSearch and forward a tuple of (url, [title, content]) to an Hdfs storage. I found that Apache-storm has a proper Hdfs bolt which looks like a straight forward implementation. I would like to know where to look for this tuple in the ES crawling topology. Could you point which bolt has this data?
You'd need not only the textual content but also the metadata as this is where the title gets stored. Look at what the JSoupParserBolt emits on the default stream and connect the HDFS bolt to its output.
This is similar to what we do with the WARC module which extends the HDFS bolt, except the WARC bolt does not require anything from the parsing step and can be connected straight to the output of the Fetcher.