Search code examples
hadoophbaseapache-zookeepernutchnutch2

How can I connect apache Nutch 2.x to a remote HBase cluster?


I have two machines. One machine runs HBase 0.92.2 in pseudo-distributed mode, while the other one is using Nutch 2.x crawler. How can I configure these two machines so that one machine with HBase-0.92.2 acts as back end storage and the other with Nutch-2.x acts as a crawler?


Solution

  • I finally did it.I was easy to do. i am sharing my experience here. May be it can help someone.

    1- change the configuration file of hbase-site.xml for pseudo distributed mode.

    2- MOST IMPORTANT THING: on hbase machine, replace localhost ip in /etc/hosts with your real network ip like this

    10.11.22.189 master localhost

    hbase machine's ip = 10.11.22.189 (note: if you won't change your hbase machine's localhost ip, remote nutch crawler won't be able to connect to it)

    4- copy/symlink hbase-site.xml into $NUTCH_HOME/conf

    5- start your crawler and see it working