Search code examples
pysparkhivespark2.4.4

spark not downloading hive_metastore jars


Environment

I am using spark v2.4.4 via the python API

Problem

According to the spark documentation I can force spark to download all the hive jars for interacting with my hive_metastore by setting the following config

  • spark.sql.hive.metastore.version=${my_version}
  • spark.sql.hive.metastore.jars=maven

However, when I run the following python code, no jar files are downloaded from maven.

   from pyspark.sql import SparkSession
   from pyspark import SparkConf
   conf = (
       SparkConf()
       .setAppName("myapp")
       .set("spark.sql.hive.metastore.version", "2.3.3")
       .set("spark.sql.hive.metastore.jars","maven")
   )
   spark = (
       SparkSession
       .builder
       .config(conf=conf)
       .enableHiveSupport()
       .getOrCreate()
   )

How do I know that no jar files are downloaded?

  1. I have configured logLevel=INFO as a default by setting log4j.logger.org.apache.spark.api.python.PythonGatewayServer=INFO in $SPARK_HOME/conf/log4j.properties. I can see no logging which says that spark is interacting with maven. according to this I should see an INFO level log
  2. Even if for some reason my logging was broken, the SparkSession object is simply building too quickly to be pulling large jars from maven. It returns in under 5 seconds. If I manually add the maven coordinates of hive_metastore to "spark.jars.packages" it takes minutes to download it all
  3. I have deleted ~/.ivy2 and ~/.m2 directories to remove caching of previous downloads

Other tests

  • I have also tried the same code on a spark 3.0.0 cluster and it doesn't work either
  • Can anyone spot what I'm doing wrong? Or is this option just broken

Solution

  • For anyone else trying to solve this:

    • The download from maven doesn't happen when you create the spark context. It happens when you run a hive command. e.g spark.catalog.listDatabases()
    • You need to ensure that the version of hive you are trying to run is supported by your version of spark. Not all versions of hive are supported and different versions of spark support different versions of hive.