Hive on Spark ERROR java.lang.NoSuchFieldError: SPARK_RPC_SERVER_ADDRESS

Running Hive on Spark with a simple select * from table query runs smoothly, but on joins and sums, the ApplicationMaster returns this stack trace for the associated spark container:

2019-03-29 17:23:43 ERROR ApplicationMaster:91 - User class threw exception: java.lang.NoSuchFieldError: SPARK_RPC_SERVER_ADDRESS
java.lang.NoSuchFieldError: SPARK_RPC_SERVER_ADDRESS
    at org.apache.hive.spark.client.rpc.RpcConfiguration.<clinit>(RpcConfiguration.java:47)
    at org.apache.hive.spark.client.RemoteDriver.<init>(RemoteDriver.java:134)
    at org.apache.hive.spark.client.RemoteDriver.main(RemoteDriver.java:516)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$4.run(ApplicationMaster.scala:706)
2019-03-29 17:23:43 INFO  ApplicationMaster:54 - Final app status: FAILED, exitCode: 13, (reason: User class threw exception: java.lang.NoSuchFieldError: SPARK_RPC_SERVER_ADDRESS
    at org.apache.hive.spark.client.rpc.RpcConfiguration.<clinit>(RpcConfiguration.java:47)
    at org.apache.hive.spark.client.RemoteDriver.<init>(RemoteDriver.java:134)
    at org.apache.hive.spark.client.RemoteDriver.main(RemoteDriver.java:516)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$4.run(ApplicationMaster.scala:706)
)
2019-03-29 17:23:43 ERROR ApplicationMaster:91 - Uncaught exception: 
org.apache.spark.SparkException: Exception thrown in awaitResult: 
    at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:205)
    at org.apache.spark.deploy.yarn.ApplicationMaster.runDriver(ApplicationMaster.scala:486)
    at org.apache.spark.deploy.yarn.ApplicationMaster.org$apache$spark$deploy$yarn$ApplicationMaster$$runImpl(ApplicationMaster.scala:345)
    at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$2.apply$mcV$sp(ApplicationMaster.scala:260)
    at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$2.apply(ApplicationMaster.scala:260)
    at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$2.apply(ApplicationMaster.scala:260)
    at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$5.run(ApplicationMaster.scala:800)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:422)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
    at org.apache.spark.deploy.yarn.ApplicationMaster.doAsUser(ApplicationMaster.scala:799)
    at org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:259)
    at org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:824)
    at org.apache.spark.deploy.yarn.ApplicationMaster.main(ApplicationMaster.scala)
Caused by: java.util.concurrent.ExecutionException: Boxed Error
    at scala.concurrent.impl.Promise$.resolver(Promise.scala:55)
    at scala.concurrent.impl.Promise$.scala$concurrent$impl$Promise$$resolveTry(Promise.scala:47)
    at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:244)
    at scala.concurrent.Promise$class.tryFailure(Promise.scala:112)
    at scala.concurrent.impl.Promise$DefaultPromise.tryFailure(Promise.scala:153)
    at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$4.run(ApplicationMaster.scala:724)
Caused by: java.lang.NoSuchFieldError: SPARK_RPC_SERVER_ADDRESS
    at org.apache.hive.spark.client.rpc.RpcConfiguration.<clinit>(RpcConfiguration.java:47)
    at org.apache.hive.spark.client.RemoteDriver.<init>(RemoteDriver.java:134)
    at org.apache.hive.spark.client.RemoteDriver.main(RemoteDriver.java:516)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$4.run(ApplicationMaster.scala:706)
2019-03-29 17:23:43 INFO  ApplicationMaster:54 - Deleting staging directory hdfs://LOSLDAP01:9000/user/hdfs/.sparkStaging/application_1553880018684_0001
2019-03-29 17:23:43 INFO  ShutdownHookManager:54 - Shutdown hook called

I have already tried to increase yarn container memory allocation (and decrease spark memory) with no success.

Using: Hadoop 2.9.2 Spark 2.3.0 Hive 2.3.4

Thank you for your help.

Solution

It turned out that Hive-on-Spark has a lot of implementation problems and essentially does not work at all unless you write your own custom Hive connector. In a nutshell, Spark devs are struggling to keep up with Hive releases and they did not yet decided how to deal with backward compatibility on how to load Hive versions ~< 2 while focusing on the newest branch.

Solutions

1) Go back to Hive 1.x

Not ideal. Especially if you want some more modern integration with file formats such as ORC.

2) Use Hive-on-Tez

This is the one we decided to adopt. *This solution does not break the open source stack* and works perfectly along with Spark-on-Yarn. 3rd party Hadoop ecosystems, like those for Azure, AWS and Hortonworks all add proprietary code just for running Hive-On-Spark because of the mess that it became.

By installing Tez, your Hadoop queries will work like this:

A direct Hive query (e.g. jdbc connection from DBeaver) will run a Tez container on the cluster
A Spark job will be able to access the Hive metastore as normal and will use a Spark container on the cluster when creating the SparkSession.builder.enableHiveSupport().getOrCreate() (this is pyspark code)

Installing Hive-on-Tez with Spark-on-Yarn

Note: I'll keep it short since I do not see much interest on these boards. Ask for details and I'll be happy to help and expand.

Version matrix

Hadoop   2.9.2
Tez      0.9.2
Hive     2.3.4
Spark    2.4.2

Hadoop is installed in cluster mode.

This is what worked for us. I would not expect it to work seamlessly when switching to Hadoop 3.x, which we will be doing at some point in the future, but it should work fine if you do not change the main release version for each component.

Basic guide

Compile Tez from source as written in the official install guide, with Mode A for sharing hadoop jars. Do not use any pre-compiled Tez distro. Test it by the hive shell with a simple query which is not a simple data access (i.e. just a select). For example, use: select count(*) from myDb.myTable. You should see the Tez bars from the hive console.
Compile Spark from source. To do so, follow the official guide (Important: download the archive labeled without-hadoop !!), but before compiling it edit the source code at ./sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveUtils.scala and comment out the following line: ConfVars.HIVE_STATS_JDBC_TIMEOUT -> TimeUnit.SECONDS,
Share $HIVE_HOME/conf/hive-site.xml, in your $SPARK_HOME/conf/ dir. You must make a hard copy of this config file and not a symlink. The reason is that you must remove all Tez-related Hive config values from it to guarantee that Spark co-exist independently with Tez, as explained above. This does include the hive.execution.engine=tez property which must be left empty. Just remove it completely from the Spark's hive-site.xml, while leaving it in the Hive's hive-site.xml.
In $HADOOP_HOME/etc/hadoop/mapred-site.xml set property mapreduce.framework.name=yarn. This will be picked up correctly by both environments even if it is not set to yarn-tez. It just means that raw mapreduce jobs will not run on Tez, while Hive jobs will indeed use it. This is a problem only for legacy jobs, since raw mapred is obsolete.

Good luck!