Search code examples
apache-sparkhive

Run Spark with build-in Hive and Configuring a remote PostgreSQL database for the Hive Metastore


I am running Spark v1.0.1 with build-in Hive (Spark install with SPARK_HIVE=true sbt/sbt assembly/assembly)

I also config Hive to store Metastore in PostgreSQL database as instruction:

http://www.cloudera.com/content/cloudera-content/cloudera-docs/CDH4/4.2.0/CDH4-Installation-Guide/cdh4ig_topic_18_4.html

I could config Hive (not build-in with Spark) to use PostgreSQL but I don't know how to get it work with Hive in Spark

In the instruction, I see that I need to put or link postgresql-jdbc.jar to hive/lib so that Hive could include the postgresql-jdbc when it run

$ sudo yum install postgresql-jdbc
$ ln -s /usr/share/java/postgresql-jdbc.jar /usr/lib/hive/lib/postgresql-jdbc.jar

With Build-in Hive in Spark, where should I put the postgresql-jdbc.jar to get it work?


Solution

  • I find the solution for my problem. I need to add CLASSPATH for SPARK so that build-in Hive could use postgresql-jdbc4.jar

    I add 3 environment variables:

    export CLASSPATH="$CLASSPATH:/usr/share/java/postgresql-jdbc4.jar"
    export SPARK_CLASSPATH=$CLASSPATH
    export SPARK_SUBMIT_CLASSPATH=$CLASSPATH
    

    SPARK_CLASSPATH is used for spark-shell

    SPARK_SUBMIT_CLASSPATH is used for spark-submit (I am not sure)

    Now I could use spark-shell with build-in Hive which config to use Metastore in Postgres