Search code examples
hadoopjdbchiveemrapache-drill

Connect to Hive on EMR using Apache Drill Embedded


I am trying to experiment on Apache Drill 1.4 in Embedded mode and trying to connect to Hive running on EMR - Drill is running on server outside EMR.

I have some basic questions that I want to get clarified and some configuration issues to be fixed.

Here is what I have so far -

Running AWS EMR cluster. Running Drill Embedded server.

According to the documentation on configuring storage plugin for Hive, https://drill.apache.org/docs/hive-storage-plugin/ , I am getting confused on whether or not to use Remote Metastore or Embedded Metastore. What is the difference?

Next, my EMR cluster is running and here is what hive-site.xml looks like -

<property>
  <name>hive.metastore.uris</name>
  <value>thrift://ec2-XX-XX-XX-XX.compute-1.amazonaws.com:9083</value>
  <description>JDBC connect string for a JDBC metastore</description>
</property>

<property>
    <name>javax.jdo.option.ConnectionURL</name>
    <value>jdbc:mysql://ec2-XX-XX-XX-XX.compute-1.amazonaws.com:3306/hive?createDatabaseIfNotExist=true</value>
    <description>username to use against metastore database</description>
</property>

There are other properties defined like MySQL username and password etc. but I guess these are important here.

Which one should I use to connect to Hive? I have tried to put both these in the storage plugin but Drill doesnt take it.

Storage plugins I have tried look like this -

{
  "type": "hive",
  "enabled": true,
  "configProps": {
    "hive.metastore.uris": "thrift://ec2-XX-XX-XX-XX.compute-1.amazonaws.com:9083",
    "fs.default.name": "hdfs://ec2-XX-XX-XX-XX.compute-1.amazonaws.com/",
    "hive.metastore.sasl.enabled": "false"
  }
}

and

{
  "type": "hive",
  "enabled": true,
  "configProps": {
    "hive.metastore.uris": "thrift://ec2-XX-XX-XX-XX.compute-1.amazonaws.com:9083",
    "javax.jdo.option.ConnectionURL": "jdbc:derby:ec2-XX-XX-XX-XX.compute-1.amazonaws.com;databaseName=data;create=true",
    "hive.metastore.warehouse.dir": "/user/hive/warehouse",
    "fs.default.name": "file:///",
    "hive.metastore.sasl.enabled": "false"
  }
}

It would be of great help if you could guide me in setting this up. Thanks!


Solution

  • I was facing several problems -

    1. VPC issue - my EMR cluster and mysql host were in different VPCs. Trivial.
    2. Mysql connection was not happening from EMR cluster to mysql host - binding was strict to localhost. Removed it.
    3. Now when I restarted hive --service metastore, I saw the error that driver name is not correct and driver class com.mysql.jdbc.Driver not found - so I had to download MySQL Connector driver as instructed in Step 2 here.
    4. After MySql could connect, metastore could connect to the database : error was mysql Database initialization failed; direct SQL is disabled, but initial tables need to be present. So the table creation had to be done with a command here - Getting MissingTableException: Required table missing VERSION when starting hive on mysql

      Go to the $HIVE_HOME and run the initschema option on the schematool:

      bin/schematool -dbType mysql -initSchema

      Make sure you have cleaned up the mysql database on which you are moving this metastore. No tables or schema or tables are present that Hive needs.

    5. After these, metastore was able to connect to external database. Now Hive is up and running with remote metastore.

    Now I hosted Drill (embedded) in new EC2 host to connect to this metastore and it worked like a charm!

    curl -X POST -H "Content-Type: application/json" -d '{"name":"hive", "config": {   "type": "hive",   "enabled": true,  "configProps": {    "hive.metastore.uris":"thrift://ip-XX.XX.XX.XX.ec2.internal:9083",    "javax.jdo.option.ConnectionURL":"jdbc:mysql://ip-XX.XX.XX.XX:3306/hive?createDatabaseIfNotExist=true",    "javax.jdo.option.ConnectionDriverName":"com.mysql.jdbc.Driver",    "javax.jdo.option.ConnectionUserName":"root",    "javax.jdo.option.ConnectionPassword":"blah",    "hive.metastore.warehouse.dir":"/user/hive/warehouse",    "fs.default.name":"hdfs://ip-XX.XX.XX.XX.ec2.internal:8020"  }}}' http://localhost:8047/storage/hive.json