Search code examples
pythonhivepysparkhortonworks-data-platform

Pyspark cannot reach hive


In short: I have a working hive on hdp3, which I cannot reach from pyspark, running under yarn (on the same hdp). How do I get pyspark to find my tables?

spark.catalog.listDatabases() only show default, any query run will not show in my hive logs.

This is my code, with spark 2.3.1

from pyspark.sql import SparkSession
from pyspark.conf import SparkConf
settings = []
conf = SparkConf().setAppName("Guillaume is here").setAll(settings)
spark = (
    SparkSession
    .builder
    .master('yarn')
    .config(conf=conf)
    .enableHiveSupport()
    .getOrCreate()
)
print(spark.catalog.listDatabases())

Note that settings is empty. I though it would be sufficient, because in the logs I see

loading hive config file: file:/etc/spark2/3.0.1.0-187/0/hive-site.xml

and more interestingly

Registering function intersectgroups io.x.x.IntersectGroups

This is a UDF I wrote and added to hive manually. This means that there is some sort of connection done.

The only output I get (except logs) is:

[ Database(name=u'default', description=u'default database', locationUri=u'hdfs://HdfsNameService/apps/spark/warehouse')]

I understand that I should set spark.sql.warehouse.dir in settings. No matter if I set it to the value I find in hive-site, the path to the database I am interested in (it's not in the default location), its parent, nothing changes.

I put many other config options in settings (including thrift uris), no changes.

I have seen as well that I should copy hive-site.xml into the spark2 conf dir. I did it on all nodes of my cluster, no changes.

My command to run is:

HDP_VERSION=3.0.1.0-187 PYTHONPATH=.:/usr/hdp/current/spark2-client/python/:/usr/hdp/current/spark2-client/python/lib/py4j-0.10.7-src.zip SPARK_HOME=/usr/hdp/current/spark2-client HADOOP_USER_NAME=hive spark-submit --master yarn --jars /usr/hdp/current/hive_warehouse_connector/hive-warehouse-connector-assembly-1.0.0.3.0.1.0-187.jar --py-files /usr/hdp/current/hive_warehouse_connector/pyspark_hwc-1.0.0.3.0.1.0-187.zip --files /etc/hive/conf/hive-site.xml ./subjanal/anal.py


Solution

  • In HDP 3.x, you need to use the Hive Warehouse Connector as described in the docs.