Search code examples
hadoophivehdfssqoopproduction-environment

Production Level Hive and Sqoop Configuration


I have some questions regarding HIVE configuration on a production level. If have a HDFS setup remotely:

  1. Where would I have to install Hive so that I can run HQL queries based on the data in HDFS? What all configurations need to be made in Hive?

  2. Where would the metastore db be located?

  3. If I want to install Sqoop, so that it can extract data from local RDBMS to remote HDFS, where should it be installed?

Solution

  • Hive Server shall be installed on a Master Node like HDFS NameNode and Secondary NameNode (see this sample schema http://pivotalhd.docs.pivotal.io/docs/01-RawContent/Getting-Started/PHD2_Typical_Cluster_Topology.png). But you also need to install YARN.

    Sqoop is usually installed on a client (edge) node.

    If you use a distribution like Hortonworks or Cloudera, they include a manager with wizards to ease deployment of all services like Hive, YARN, HBase, etc.