Search code examples
mavenapache-sparkhivethrift

hivethriftserver built into spark3.0.0. is throwing error "org.postgresql.Driver" was not found in the CLASSPATH


I am moving from Spark2.3.2 with an external Hive server to Spark3.0.0 with a built in thrift hive server, however I am having trouble getting the thriftsever budled with spark to find the postgresql client libraries to connect to an external metastore.

In Spark2.3.3 I simply set the $HIVE_HOME/conf/hive-site.xml options for the metastore, added the jars to $HIVE_HOME/lib and everything worked. In Spark3.0.0 I declared the location of the jars in $SPARK_HOME/conf/hive-site.xml like so..

<property>
  <name>spark.sql.hive.metastore.jars</name>
  <value>/usr/lib/postgresql</value>
  <description></description>
</property>

but that did not work. I build spark from source, so I also tried adding the postgresql dependencies to the spark pom.xml and thriftserver pom.xml files, so that maven can pull in the libraries during compile. But that didn't work either, so I'm stuck as to what to try next.

Here is the thriftserver log showing the error....

20/06/23 03:54:54 INFO SharedState: loading hive config file: file:/opt/spark-3.0.0/conf/hive-site.xml
20/06/23 03:54:54 INFO SharedState: Setting hive.metastore.warehouse.dir ('null') to the value of spark.sql.warehouse.dir ('file:/opt/spark-3.0.0/spark-warehouse').
20/06/23 03:54:54 INFO SharedState: Warehouse path is 'file:/opt/spark-3.0.0/spark-warehouse'.
20/06/23 03:54:55 INFO HiveUtils: Initializing HiveMetastoreConnection version 2.3.7 using Spark classes.
20/06/23 03:54:55 INFO HiveConf: Found configuration file file:/opt/spark-3.0.0/conf/hive-site.xml
20/06/23 03:54:56 INFO SessionState: Created HDFS directory: /tmp/hive/ubuntu/c3454b03-bce5-4e0e-8a8b-3d7532470b3c
20/06/23 03:54:56 INFO SessionState: Created local directory: /tmp/ubuntu/c3454b03-bce5-4e0e-8a8b-3d7532470b3c
20/06/23 03:54:56 INFO SessionState: Created HDFS directory: /tmp/hive/ubuntu/c3454b03-bce5-4e0e-8a8b-3d7532470b3c/_tmp_space.db
20/06/23 03:54:56 INFO HiveClientImpl: Warehouse location for Hive client (version 2.3.7) is file:/opt/spark-3.0.0/spark-warehouse
20/06/23 03:54:57 WARN HiveConf: HiveConf of name hive.stats.jdbc.timeout does not exist
20/06/23 03:54:57 WARN HiveConf: HiveConf of name hive.stats.retries.wait does not exist
20/06/23 03:54:57 INFO HiveMetaStore: 0: Opening raw store with implementation class:org.apache.hadoop.hive.metastore.ObjectStore
20/06/23 03:54:57 INFO ObjectStore: ObjectStore, initialize called
20/06/23 03:54:57 INFO Persistence: Property hive.metastore.integral.jdo.pushdown unknown - will be ignored
20/06/23 03:54:57 INFO Persistence: Property datanucleus.cache.level2 unknown - will be ignored
20/06/23 03:54:57 ERROR Datastore: Exception thrown creating StoreManager. See the nested exception
Error creating transactional connection factory
org.datanucleus.exceptions.NucleusException: Error creating transactional connection factory
    at org.datanucleus.store.AbstractStoreManager.registerConnectionFactory(AbstractStoreManager.java:214)

....

Caused by: org.datanucleus.exceptions.NucleusException: Attempt to invoke the "BONECP" plugin to create a ConnectionPool gave an error : The specified datastore driver ("org.postgresql.Driver") was not found in the CLASSPATH. Please check your CLASSPATH specification, and the name of the driver.
        at org.datanucleus.store.rdbms.ConnectionFactoryImpl.generateDataSources(ConnectionFactoryImpl.java:232)
        at org.datanucleus.store.rdbms.ConnectionFactoryImpl.initialiseDataSources(ConnectionFactoryImpl.java:117)
        at org.datanucleus.store.rdbms.ConnectionFactoryImpl.<init>(ConnectionFactoryImpl.java:82)
        ... 100 more
Caused by: org.datanucleus.store.rdbms.connectionpool.DatastoreDriverNotFoundException: The specified datastore driver ("org.postgresql.Driver") was not found in the CLASSPATH. Please check your CLASSPATH specification, and the name of the driver.
        at org.datanucleus.store.rdbms.connectionpool.AbstractConnectionPoolFactory.loadDriver(AbstractConnectionPoolFactory.java:58)
        at org.datanucleus.store.rdbms.connectionpool.BoneCPConnectionPoolFactory.createConnectionPool(BoneCPConnectionPoolFactory.java:54)
        at org.datanucleus.store.rdbms.ConnectionFactoryImpl.generateDataSources(ConnectionFactoryImpl.java:213)

Here is my $SPARK_HOME/conf/hive-site.xml file

<configuration>

 <property>
  <name>javax.jdo.option.ConnectionURL</name>
  <value>jdbc:postgresql://xxxxxxx:54321/db?sslmode=require</value>
  <description></description>
 </property>


 <property>
  <name>javax.jdo.option.ConnectionDriverName</name>
  <value>org.postgresql.Driver</value>
  <description></description>
 </property>

 <property>
  <name>javax.jdo.option.ConnectionUserName</name>
  <value>xxxxx</value>
  <description>db user</description>
 </property>

 <property>
  <name>javax.jdo.option.ConnectionPassword</name>
  <value>xxxxx</value>
  <description>db password</description>
 </property>

 <property>
  <name>hive.execution.engine</name>
  <value>spark</value>
 </property>

 <property>
  <name>spark.master</name>
  <value>spark://127.0.1.1:7077</value>
 </property>

 <property>
  <name>spark.eventLog.enabled</name>
  <value>true</value>
 </property>

 <property>
  <name>spark.eventLog.dir</name>
  <value>/var/log/spark</value>
 </property>

 <property>
  <name>spark.executor.memory</name>
  <value>2048m</value>
 </property>

 <property>
  <name>spark.serializer</name>
  <value>org.apache.spark.serializer.KryoSerializer</value>
 </property>

<property>
  <name>hive.server2.use.SSL</name>
  <value>false</value>
  <description></description>
</property>

<property>
  <name>spark.sql.hive.metastore.jars</name>
  <value>builtin</value>
  <description></description>
</property>

</configuration>

In the .bashrc I have set the following env variables

export JAVA_HOME=/usr/lib/jvm/jdk1.8.0_251
export SPARK_HOME=/opt/spark
export PATH=$SPARK_HOME/bin:$SPARK_HOME/sbin:$PATH

When starting the thriftserver I use the provided start-thriftserver.sh command


Solution

  • The problem with maven dependencies appears to be with the incremental maven build not pulling in any new dependencies. An ugly fix for this is to force a download and complete rebuild like so...

    ./build/mvn dependency:purge-local-repository -Pyarn -Dhadoop.version=2.8.5 -Phive -Phive-thriftserver -DskipTests clean package