Search code examples
apache-sparkhivegoogle-cloud-dataprochive-metastore

Create a Google Dataproc cluster and connect to an external remote Hive metastore


I am trying to create a dataproc cluster and point to a remote Hive metastore in order to access the Hive tables from this cluster. I am using the below create cluster command to create a Dataproc 2.0 cluster with connection to remote Hive metastore:

Create Cluster Command:

gcloud dataproc clusters create wl1-cluster-1 \
--region us-east1 \
--subnet projects/shared-vpc-admin/regions/us-east1/subnetworks/dev-us-east1-01 \
--enable-component-gateway \
--no-address \
--scopes 'https://www.googleapis.com/auth/cloud-platform' \
--master-machine-type n1-standard-32 \
--master-boot-disk-size 1000 \
--num-workers 10 \
--worker-machine-type n1-standard-32 \
--worker-boot-disk-size 1000 \
--image-version 2.0-debian10 \
--properties 'spark:spark.sql.hive.metastore.version=2.3.0,spark:spark.sql.hive.metastore.jars=maven,hive:hive.metastore.schema.verification=false,hive:javax.jdo.option.ConnectionURL=jdbc:mysql://test-mysql.gcp-dev.glb.us.mycompany.net:3306/metastore,hive:javax.jdo.option.ConnectionUserName=metastore,hive:javax.jdo.option.ConnectionPassword=XXXX' \
--project sample_gcp_project

But it is failing during the step activate-component-hive-metastore.

Error Message:

<13>Sep 10 05:47:06 google-dataproc-startup[1734]: <13>Sep 10 05:47:06 activate-component-hive-metastore[3064]: nc: connect to wl1-cluster-1-m port 9083 (tcp) failed: Connection refused
<13>Sep 10 05:47:07 google-dataproc-startup[1734]: <13>Sep 10 05:47:07 activate-component-hive-metastore[3064]: nc: connect to wl1-cluster-1-m port 9083 (tcp) failed: Connection refused
<13>Sep 10 05:47:07 google-dataproc-startup[1734]: <13>Sep 10 05:47:07 activate-component-hive-metastore[3064]: 'nc -v -z -w 1 wl1-cluster-1-m 9083' attempt 23 failed! Sleeping 1s.

Few questions:

  1. Is this the correct and only way to connect to a remote hive metastore while creating a cluster?
  2. Instead of configuring the cluster with connection to remote hive metastore, can I access the hive metastore from a spark job submitted from this cluster. Can you please point me in the right direction?
  3. Is there any other approach to solve this problem?

Solution

  • By default, there is a Metastore service running on the (1st) master node of a Dataproc cluster, the default port is 9083, and Hive Server2 is configured to use it. In /etc/hive/conf/hive-site.xml, you will find:

      <property>
        <name>hive.metastore.uris</name>
        <value>thrift://<master-hostname>:9083</value>
      </property>
    

    If you need the HiveServer2 of your cluster:

    1. to use the Metastore of another cluster, set --properties hive:hive.metastore.uris=thrift://<another-cluster-master-hostname>:9083.

    2. or to use the managed Dataproc Metastore Service, follow this instruction, for example:

    gcloud dataproc clusters create example-cluster \
        --dataproc-metastore=projects/PROJECT_ID/locations/LOCATION/services/example-service \
        --region=LOCATION
    

    The clusters/Metastore need to be in the same VPC network to communicate with each other through internal IP (or hostname).

    Spark in the cluster is automatically configured to use the Hive config hive.metastore.uris to find the Metastore, you don't need any additional config in your Spark code.