Search code examples
apache-sparkpysparkapache-spark-sql

Spark Catalog doesn't see the database that I created


I have been learning Spark (3.5.0) and I tried out the following exercise:

  1. start a spark session locally :
spark = pyspark.sql.SparkSession\
        .builder\
        .master("local")\
        .appName("hello-spark")\
        .getOrCreate()
  1. use spark.catalog API to list all databases spark.catalog.listDatabases() I see this:

[Database(name='default', catalog='spark_catalog', description='default database', locationUri='file:/Users/samarth/Desktop/projects/groKPy/sparkey/Coursework/spark-warehouse')]

  1. I create a database my_db using spark.sql('create database my_db') and create a table using
spark.sql("""create table my_db.fire_service_calls_tbl (CallNumber integer, UnitID string, IncidentNumber integer)
          using parquet""")
  1. Listing databases again spark.catalog.listDatabases(), I see two databases like this:

[Database(name='default', catalog='spark_catalog', description='default database', locationUri='file:/Users/samarth/Desktop/projects/groKPy/sparkey/Coursework/spark-warehouse'),

Database(name='my_db', catalog='spark_catalog', description='', locationUri='file:/Users/samarth/Desktop/projects/groKPy/sparkey/Coursework/spark-warehouse/my_db.db')]

I also see the new my_db.db folder containing the tables that I have created in the file directory.

However, the next day when I start a new spark Session and I do a spark.catalog.listDatabases(), spark can NOT find my_db at all! It just shows the default database:

Database(name='default', catalog='spark_catalog', description='default database', locationUri='file:/Users/samarth/Desktop/projects/groKPy/sparkey/Coursework/spark-warehouse')]

Running spark.catalog.databaseExists('my_db') returns False

I am sure I am missing some fundamental point on how this works. I was expecting to see my_db to show up but it did not.


Solution

  • I've been having the exact same problem. From what I can tell, the spark.catalog object reads from memory rather than actually checking what has been persisted to disk. This obviously isn't what you want. YMMV but I found that adding some configuration to the session will get the behavior you want. Here's my config:

    spark = SparkSession.builder \
                        .config("spark.sql.warehouse.dir", "path/to/your/catalog") \
                        .enableHiveSupport() \
                        .appName("your-app-name") \
                        .getOrCreate()
    

    I believe that it is necessary to both of these things:

    1. specify where the warehouse directory is. This can be any path, just someplace where you want the actual databases and tables to be persisted.
    2. enable hive support

    Doing this causes a metastore_db directory to be created when you actually create a db. spark.catalog will now read the information in there when you ask it to list databases / tables or check if a db / table exists.

    One caveat is that I'm doing this on my local computer, not using any cloud or managed services.