Spark Catalog doesn't see the database that I created

I have been learning Spark (3.5.0) and I tried out the following exercise:

start a spark session locally :

spark = pyspark.sql.SparkSession\
        .builder\
        .master("local")\
        .appName("hello-spark")\
        .getOrCreate()

use spark.catalog API to list all databases spark.catalog.listDatabases() I see this:

[Database(name='default', catalog='spark_catalog', description='default database', locationUri='file:/Users/samarth/Desktop/projects/groKPy/sparkey/Coursework/spark-warehouse')]

I create a database my_db using spark.sql('create database my_db') and create a table using

spark.sql("""create table my_db.fire_service_calls_tbl (CallNumber integer, UnitID string, IncidentNumber integer)
          using parquet""")

Listing databases again spark.catalog.listDatabases(), I see two databases like this:

[Database(name='default', catalog='spark_catalog', description='default database', locationUri='file:/Users/samarth/Desktop/projects/groKPy/sparkey/Coursework/spark-warehouse'),

Database(name='my_db', catalog='spark_catalog', description='', locationUri='file:/Users/samarth/Desktop/projects/groKPy/sparkey/Coursework/spark-warehouse/my_db.db')]

I also see the new my_db.db folder containing the tables that I have created in the file directory.

However, the next day when I start a new spark Session and I do a spark.catalog.listDatabases(), spark can NOT find my_db at all! It just shows the default database:

Database(name='default', catalog='spark_catalog', description='default database', locationUri='file:/Users/samarth/Desktop/projects/groKPy/sparkey/Coursework/spark-warehouse')]

Running spark.catalog.databaseExists('my_db') returns False

I am sure I am missing some fundamental point on how this works. I was expecting to see my_db to show up but it did not.

Solution

I've been having the exact same problem. From what I can tell, the spark.catalog object reads from memory rather than actually checking what has been persisted to disk. This obviously isn't what you want. YMMV but I found that adding some configuration to the session will get the behavior you want. Here's my config:

spark = SparkSession.builder \
                    .config("spark.sql.warehouse.dir", "path/to/your/catalog") \
                    .enableHiveSupport() \
                    .appName("your-app-name") \
                    .getOrCreate()

I believe that it is necessary to both of these things:

specify where the warehouse directory is. This can be any path, just someplace where you want the actual databases and tables to be persisted.
enable hive support

Doing this causes a metastore_db directory to be created when you actually create a db. spark.catalog will now read the information in there when you ask it to list databases / tables or check if a db / table exists.

One caveat is that I'm doing this on my local computer, not using any cloud or managed services.