I am moving from Spark2.3.2 with an external Hive server to Spark3.0.0 with a built in thrift hive server, however I am having trouble getting the thriftsever budled with spark to find the postgresql client libraries to connect to an external metastore.
In Spark2.3.3 I simply set the $HIVE_HOME/conf/hive-site.xml
options for the metastore, added the jars to $HIVE_HOME/lib
and everything worked. In Spark3.0.0 I declared the location of the jars in $SPARK_HOME/conf/hive-site.xml
like so..
<property>
<name>spark.sql.hive.metastore.jars</name>
<value>/usr/lib/postgresql</value>
<description></description>
</property>
but that did not work. I build spark from source, so I also tried adding the postgresql dependencies to the spark pom.xml
and thriftserver pom.xml
files, so that maven can pull in the libraries during compile. But that didn't work either, so I'm stuck as to what to try next.
Here is the thriftserver log showing the error....
20/06/23 03:54:54 INFO SharedState: loading hive config file: file:/opt/spark-3.0.0/conf/hive-site.xml
20/06/23 03:54:54 INFO SharedState: Setting hive.metastore.warehouse.dir ('null') to the value of spark.sql.warehouse.dir ('file:/opt/spark-3.0.0/spark-warehouse').
20/06/23 03:54:54 INFO SharedState: Warehouse path is 'file:/opt/spark-3.0.0/spark-warehouse'.
20/06/23 03:54:55 INFO HiveUtils: Initializing HiveMetastoreConnection version 2.3.7 using Spark classes.
20/06/23 03:54:55 INFO HiveConf: Found configuration file file:/opt/spark-3.0.0/conf/hive-site.xml
20/06/23 03:54:56 INFO SessionState: Created HDFS directory: /tmp/hive/ubuntu/c3454b03-bce5-4e0e-8a8b-3d7532470b3c
20/06/23 03:54:56 INFO SessionState: Created local directory: /tmp/ubuntu/c3454b03-bce5-4e0e-8a8b-3d7532470b3c
20/06/23 03:54:56 INFO SessionState: Created HDFS directory: /tmp/hive/ubuntu/c3454b03-bce5-4e0e-8a8b-3d7532470b3c/_tmp_space.db
20/06/23 03:54:56 INFO HiveClientImpl: Warehouse location for Hive client (version 2.3.7) is file:/opt/spark-3.0.0/spark-warehouse
20/06/23 03:54:57 WARN HiveConf: HiveConf of name hive.stats.jdbc.timeout does not exist
20/06/23 03:54:57 WARN HiveConf: HiveConf of name hive.stats.retries.wait does not exist
20/06/23 03:54:57 INFO HiveMetaStore: 0: Opening raw store with implementation class:org.apache.hadoop.hive.metastore.ObjectStore
20/06/23 03:54:57 INFO ObjectStore: ObjectStore, initialize called
20/06/23 03:54:57 INFO Persistence: Property hive.metastore.integral.jdo.pushdown unknown - will be ignored
20/06/23 03:54:57 INFO Persistence: Property datanucleus.cache.level2 unknown - will be ignored
20/06/23 03:54:57 ERROR Datastore: Exception thrown creating StoreManager. See the nested exception
Error creating transactional connection factory
org.datanucleus.exceptions.NucleusException: Error creating transactional connection factory
at org.datanucleus.store.AbstractStoreManager.registerConnectionFactory(AbstractStoreManager.java:214)
....
Caused by: org.datanucleus.exceptions.NucleusException: Attempt to invoke the "BONECP" plugin to create a ConnectionPool gave an error : The specified datastore driver ("org.postgresql.Driver") was not found in the CLASSPATH. Please check your CLASSPATH specification, and the name of the driver.
at org.datanucleus.store.rdbms.ConnectionFactoryImpl.generateDataSources(ConnectionFactoryImpl.java:232)
at org.datanucleus.store.rdbms.ConnectionFactoryImpl.initialiseDataSources(ConnectionFactoryImpl.java:117)
at org.datanucleus.store.rdbms.ConnectionFactoryImpl.<init>(ConnectionFactoryImpl.java:82)
... 100 more
Caused by: org.datanucleus.store.rdbms.connectionpool.DatastoreDriverNotFoundException: The specified datastore driver ("org.postgresql.Driver") was not found in the CLASSPATH. Please check your CLASSPATH specification, and the name of the driver.
at org.datanucleus.store.rdbms.connectionpool.AbstractConnectionPoolFactory.loadDriver(AbstractConnectionPoolFactory.java:58)
at org.datanucleus.store.rdbms.connectionpool.BoneCPConnectionPoolFactory.createConnectionPool(BoneCPConnectionPoolFactory.java:54)
at org.datanucleus.store.rdbms.ConnectionFactoryImpl.generateDataSources(ConnectionFactoryImpl.java:213)
Here is my $SPARK_HOME/conf/hive-site.xml
file
<configuration>
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:postgresql://xxxxxxx:54321/db?sslmode=require</value>
<description></description>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>org.postgresql.Driver</value>
<description></description>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>xxxxx</value>
<description>db user</description>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>xxxxx</value>
<description>db password</description>
</property>
<property>
<name>hive.execution.engine</name>
<value>spark</value>
</property>
<property>
<name>spark.master</name>
<value>spark://127.0.1.1:7077</value>
</property>
<property>
<name>spark.eventLog.enabled</name>
<value>true</value>
</property>
<property>
<name>spark.eventLog.dir</name>
<value>/var/log/spark</value>
</property>
<property>
<name>spark.executor.memory</name>
<value>2048m</value>
</property>
<property>
<name>spark.serializer</name>
<value>org.apache.spark.serializer.KryoSerializer</value>
</property>
<property>
<name>hive.server2.use.SSL</name>
<value>false</value>
<description></description>
</property>
<property>
<name>spark.sql.hive.metastore.jars</name>
<value>builtin</value>
<description></description>
</property>
</configuration>
In the .bashrc
I have set the following env variables
export JAVA_HOME=/usr/lib/jvm/jdk1.8.0_251
export SPARK_HOME=/opt/spark
export PATH=$SPARK_HOME/bin:$SPARK_HOME/sbin:$PATH
When starting the thriftserver I use the provided start-thriftserver.sh
command
The problem with maven dependencies appears to be with the incremental maven build not pulling in any new dependencies. An ugly fix for this is to force a download and complete rebuild like so...
./build/mvn dependency:purge-local-repository -Pyarn -Dhadoop.version=2.8.5 -Phive -Phive-thriftserver -DskipTests clean package