I want to crawl a particular forum near real time and dump the data into HDFS if not Hbase.
I heard Apache Nutch could solve the purpose but sadly the technology stack it needed is pretty old. I don't want to downgrade the hadoop from 2.6 to earlier version and Elasticsearch to 1.7/1.4 hence i shifted my focus to storm-crawler.
Since I am using Hadoop 2.6, Elasticsearch 2.0 and Hbase 1.1.3, can anyone tell me if storm-crawler 0.9 can be used along with them?
Since you have a particular requirement to crawl the forum in a near real time fashion, Nutch is not the best technology to accomplish this. Nutch works on batches, meaning links are generated, then fetched, then parsed, but this wouldn't happen one link at the time. Storm crawler on the other hand is based on Apache Storm, which is a free and open source distributed realtime computation system.
Storm Crawler does support indexing into Elasticsearch 1.7.2 at this moment (support for version 2, is on the way https://github.com/DigitalPebble/storm-crawler/tree/es2/external/elasticsearch), no support for indexing into HBase exist at the moment and you couldn't use your hadoop setup because it is based on Apache Storm. Nevertheless Storm Crawler is "A collection of resources for building low-latency, scalable web crawlers" so you can write your own indexer bolt into HBase which shouldn't be too hard, and reuse the rest of the provided resources including the real time crawling that you need.